Machine learning models can do amazing things if given enough training data. Unfortunately, access to high-quality data remains a barrier for many applications. Data augmentation, a technique that generates new training examples from existing ones, is one solution to this problem. In data-constrained environments, data augmentation is a low-cost and effective method for improving the performance and accuracy of machine learning models.
Overfitting occurs when machine learning models are trained on a small number of examples. Overfitting occurs when an ML model performs well on its training data but fails to generalize to new data. There are several ways to avoid overfitting in machine learning, including using different algorithms, changing the architecture of the model, and adjusting hyperparameters. However, the main cure for overfitting is to add more high-quality data to the training dataset.
Obtaining additional training examples, on the other hand, can be costly, time-consuming, or even impossible in some cases. This challenge turns even more arduous in supervised learning applications, where training examples must be labeled by human experts.
Making copies of existing data and making minor changes to them is one method for increasing the diversity of the training dataset. This is known as data augmentation. Assume we have twenty duck images in our image classification dataset. We have doubled the training examples for the “duck” class by making copies of our duck images and flipping them horizontally. Other transformations, such as rotation, cropping, zooming, and translation, are available. We can also combine the transformations to add to our collection of one-of-a-kind training examples.
Geometric manipulation is not the only type of data augmentation. Adding noise, changing color settings, and applying effects like blur and sharpening filters to existing training examples can also aid in repurposing them as new data. Data augmentation is especially useful for supervised learning since we already have the labels and don’t have to spend time annotating the new examples. Other classes of machine learning algorithms, like unsupervised learning, contrastive learning, and generative models, benefit from data augmentation as well.
For training machine learning models for computer vision applications, data augmentation has become a standard practice. Popular machine learning and deep learning programming libraries include simple functions for incorporating data augmentation into the ML training pipeline. Data augmentation is not limited to images and can be used on other types of data as well. Nouns and verbs in text datasets can be replaced with synonyms. Training examples in audio data can be modified by adding noise or changing the playback speed.
Data augmentation is not a panacea for all of our data issues. Consider it a free performance boost for our ML models. Depending on our intended application, we will still require a reasonably large training dataset with sufficient examples. In some cases, training data may be insufficient for data augmentation to be useful. In these cases, we must collect additional data until we reach a certain threshold before using data augmentation.
Transfer learning is a technique that allows us to train an ML model on a general dataset and then repurpose it by finetuning its higher layers on the limited data available for our target application.
Data augmentation does not address other issues, like biases in the training dataset. The data augmentation process must also be adjusted to account for other potential issues, like class imbalance.