HomeArtificial IntelligenceArtificial Intelligence EducationGuide to Robust Image Classification with a Small Data Set

Guide to Robust Image Classification with a Small Data Set

[ad_1]


One of the biggest myths about AI is that you need to have a large amount of data to obtain sufficient accuracy — and the rapid development of Big Data analytics seems to prove this intuition. It is true, that deep learning methods require model training on a huge number of labeled images. However, in image classification even a small collection of training images may produce a reasonable accuracy rate (90–100%) if using new machine learning techniques, that either make use of previously collected data to adjacent domains or modify the classification process completely, working on similarity of images.

Knowledge Cross-Utilization

Similar to human capability to apply knowledge obtained in one sphere to related spheres, machine learning and deep learning algorithms can also utilize the knowledge acquired for one task to sole adjacent problems.

Even though traditionally ML/DL algorithms are designed to work in isolation to address specific tasks, the methods of transfer knowledge and domain adaptation are aimed to overcome the isolated learning paradigm to develop models which would be closer to a human way of learning.

Transfer learning

Transfer learning is the method that generalizes knowledge, including features and weights, from previously learned tasks and applies them to newer, related ones that lack data. In computer vision, for instance, certain low-level features, such as edges, shapes, corners and intensity, can be shared across multiple tasks.

To understand how it works, we can use the framework presented in the paper, A Survey on Transfer Learning (Pan & Yang 2010) where they use domain, task, and marginal probabilities:

A domain D consists of two components: a feature space X and a marginal probability distribution P(x), where x ∈ X . As a rule if two domains are different, they may have different feature spaces or different marginal probability distributions.

Similarly, a task T consists of two components: a label space Y and a predictive function f(·), i.e., a mapping from the feature space to the label space. From a probabilistic viewpoint, f(x) can also be written as the conditional distribution P(y|x). Based on these representations, transfer knowledge can be defined as follows: given a source domain Ds and learning task Ts, a target domain Dt and learning task Tt, transfer learning aims to help improve the learning of the target predictive function fT (·) in DT using the knowledge in DS and TS, where DS DT , or TS TT, T = {Y, f(·)}.(Pan & Yang 2010) In most cases, it is assumed that there is a limited number of labeled target examples, which is exponentially smaller than the number of labeled source examples are available.

To explain how transfer learning can be used in a real-life, let’s look at one particular application which is learning from simulation. Simulation is the preferred tool for gathering data and training a model rather than collecting data in the real world. While learning from a simulation and applying the acquired knowledge to the real world, the model uses the same feature spaces between source and target domain (both generally rely on pixels). However, the marginal probability distributions between simulation and reality are different, so objects in the simulation and the source look different, although this difference diminishes as simulations become more realistic.

Further reading

SJ Pan, Q Yang (2009). A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), 1345–1359 [PDF]

Domain adaptation

Usually considered a subfield of transfer knowledge, domain adaptation refers to the method of fighting the so-called domain shift challenge: since the distribution of data in the target domain is different than in the source domain and there exists the similar gap between the marginal probabilities between the source and target domains, such as P(Xs) ≠ P(Xt), there is a need to devise models that can cope with this shift.

To achieve successful unsupervised domain adaptation we need to cover three main aspects:

  • domain-agnostic feature extraction: the distributions of features extracted from both domains should be indistinguishable as judged by an adversarial discriminator network;
  • domain-specific reconstruction: embeddings should be decoded back to the source and target domains;
  • cycle consistency: to ensure that the mappings are learned correctly, we should be able to get back where we started.

The simplest approach to unsupervised domain adaptation is building a network to extract features that remain the same across the domains by making them indistinguishable for a separate part of the network, a discriminator. But at the same time, these features should be representative for the source domain so the network will be able to classify objects. As the approach is unsupervised, we don’t have to have any labels for the target domain, only for the source domain, and in many cases, for synthetic data.

Alternatively, domain adaptation can map the source data distribution to the target distribution. Both domains X and Y could be mapped into a shared domain Z where the distributions are aligned. This embedding must be domain-agnostic, hence we want to maximize the similarity between the distributions of embedded source and target images.

Further reading

Murez, Zak & Kolouri, Soheil & Kriegman, David & Ramamoorthi, Ravi & Kim, Kyungnam. (2017). Image to Image Translation for Domain Adaptation. [PDF]

Pinheiro, Pedro H. O.( 2018). Unsupervised Domain Adaptation with Similarity Learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018): 8004–8013. [PDF]

Similarity-based approaches

An alternative to direct classifying of an input image to any of the output classes is measuring the similarity between images by learning a similarity function.

Few-shot learning

Few-shot learning is an object categorization problem mostly in computer vision. In contrast to other ML-based algorithms, few-shot learning aims to learn information about object categories from a single (the so-called one-shot learning) or a few training images. In addition to the input image, it also takes a reference image of a specific object as input and produces a similarity score denoting the chances that the two input images belong to the same object.

In its simplest form, one-shot learning method computes a distance-weighted combination of support set labels. The distance metric can be defined using a Siamese network that uses two identical fully connected CNNs with same weights and accepting two different images. The last layers of the two networks are then fed to a contrastive loss function, which calculates the similarity between the two images.

The first network outputs the encoding/vector of the image being queried and the second network, correspondingly, — the encoding/vector of the reference image from the dataset. Afterwards, the two encodings are compared to check whether there is a similarity between the images. The networks are optimized based on the loss between their outputs by using the triplet loss or the contrastive lost functions.

The triplet loss function is used to calculate gradients and is represented as follows:

where a represents the anchor image (or the reference image from the data set), p represents a positive image and n represents a negative image. We know that the dissimilarity between a and p should be less than the dissimilarity between a and n. Another variable called margin is added as a hyperparameter to defne how far away the dissimilarities should be, i.e if margin = 0.2 and d(a,p) = 0.5 then d(a,n) should at least be equal to 0.7.

The contrastive loss function is given as follows:

where Dw is the Euclidean distance between the outputs of the sister Siamese networks. Mathematically the Euclidean distance is represented as follows:

where Gw is the output of one of the sister networks. X1 and X2 is the input data pair

The loss functions calculate the gradients that are used to update the weights and biases of the Siamese network. The loss will be smaller if the images are similar and will be further apart when images are not similar.

Development of the approach can be seen in the method by Santoro et al. (2016) using Memory-Augmented Neural Network (MANN). In their model, a neural network extended with an external memory module so that the model is differentiable and can be trained end-to-end. Thanks to their training procedure, they forced the network to learn general knowledge whereas the quick memory access allowed to rapidly bind this general knowledge to new data.

[ad_2]

This article has been published from the source link without modification to the text. Only the headline has been changed.

Source link

Most Popular