Many machine learning tasks need high-quality data, like determining satellite damage that impairs model performance. If usable data exists, datasets can cost millions of dollars to create, and even the best datasets can contain biases sometimes that negatively impact a model’s performance.
Many scientists have been working on answering an intriguing question using synthetic data sampled from a generative model rather than real data. A generative model is a machine-learning model that requires far less memory to maintain or share than a dataset. In recent years, the variety and quality of generative models have greatly improved.
Synthetic data can avoid some of the privacy and usage rights issues that limit how actual data can be distributed. To overcome biases in traditional datasets, a generative model could be updated to eliminate specific attributes, like race or gender.
The MIT team has developed a method for training a machine learning (ML) model that, rather than requiring a dataset, uses a specific type of ML model to generate exceptionally realistic synthetic data that can be used to train another model for downstream vision tasks.
Their findings imply that a contrastive representation learning model trained solely on synthetic data can produce visual representations that are comparable to, if not superior to, those learned from real-world data.
After being trained on actual data, a generative model can generate synthetic data that is indistinguishable from the original. The training method feeds the generative model millions of photos of objects in a specific class (like vehicles or cats), and it learns how to generate similar objects.
By simply flipping a switch, this pretrained generative model can be used to generate a steady stream of unique and realistic images based on those in the model’s training dataset.
According to the team, generative models are more beneficial because they learn how to modify the underlying data on which they are trained. A model trained on vehicle photos, for example, can “imagine” how a car would look in new scenarios — situations it hasn’t seen before — and then generate images of the car in various positions, colors, or sizes.
Multiple views of the same image are required for contrastive learning, which involves exposing a machine-learning model to a large number of unlabeled images to determine whether pairs are similar or different.
The researchers linked a pre-trained generative model to a contrastive learning model to enable two models to work together automatically. They believe that a generative model can help the contrastive method learn more accurate representations since generative models can provide multiple perspectives on the same object.
The team discovered that their model outperforms a variety of other image classification models trained on real-world data.
A generative model has the advantage of being able to generate an infinite number of samples. As a result, the researchers investigated how the number of samples used affected the model’s performance. They discovered that increasing the number of unique samples generated resulted in even more benefits in some cases. These generative models have already been trained and are available in online repositories for anyone to use.
The team points out that in some cases, these models can reveal source data, which raises privacy concerns. Furthermore, if they are not properly trained, they can amplify biases in their trained datasets.
The team intends to address this matter shortly. Furthermore, they intend to investigate methods for using this technique to generate corner cases that could aid in the improvement of machine learning models.
Real data, most of the time, cannot be used to learn about corner cases. For instance, if researchers create a computer vision model for a self-driving car, authentic data would not include images of a dog and its owner sprinting down a highway. As a result, the model would never learn what to do in this situation. In some high-stakes situations, generating that corner case data synthetically could improve the performance of machine learning models.