Teaching AI to multi-task
If you can recognize a dog visually, you can recognize it in words. Artificial intelligence today is not. Deep neural networks are very good at identifying objects in photos and talking in natural language, but not at the same time. There are AI models that excel in one or the other, but not both.
Part of the problem is that these models use different techniques to learn different skills. This is a major obstacle to developing more general-purpose AI, machines that can multitask and adapt. This also means that deep learning gained from one technique is often not transferred to another.
The Meta AI (formerly Facebook AI Research) team wants to change that. Researchers have developed a single algorithm that can be used to train neural networks to recognize images, text, or speech. An algorithm called Data2vec not only standardizes the learning process, but is as good as existing methods in at least all three features. “I hope people’s way of thinking about this kind of work will change,” said Michael Auli, a researcher at MetaAI.
This study is based on an approach called self-supervised learning. In this approach, the neural network learns to recognize the patterns in the dataset on its own, without being guided by the labeled examples. This is a way for large language models like GPT3 to learn from a large amount of unlabeled text stripped from the Internet, spurring many of the recent advances in deep learning.
Auli at Meta AI and his colleagues were working on self-supervised learning for speech recognition. However, when they looked at what other researchers were doing with self-supervised learning of images and text, they found that everyone was using different techniques to pursue the same goal.
Data2vec uses two neural networks, student and teacher. First, the teacher network is trained in the usual way with images, text, or speech to learn the internal representation of this data so that it can predict what will be displayed when new examples are presented. When it view a picture of a dog, it will be recognized as a dog.
Twist is that the network of students is trained to predict the teacher’s internal representation. In other words, it learns not to guess that it is looking at a picture of a dog, when shown a dog, but to guess what the teacher sees when shown that picture.
Because the student is trying to guess the teacher’s thoughts about the image or sentence and not the actual image or sentence, the algorithm does not need to be tuned for a particular type of input.
Data2vec is part of a larger trend in AI towards models that can learn to understand the world in more than one way. It’s a smart idea, says Ani Kembhavi of the Allen Institute for Artificial Intelligence in Seattle, which studies vision and language. It’s a promising step forward when it comes to generalized learning systems.
It is important to note that although the same learning algorithm can be used for different skills, it can only learn one skill at a time. Once it has has learned image recognition, it has to start all over again to learn how to recognize voices. It’s hard to feed an AI with multiple skills at once, but that’s what the Meta AI team wants to look at next.
Researchers were surprised to find that their approach to image and speech recognition actually outperformed existing technologies and worked as well as the main language model for text comprehension.
Mark Zuckerberg is already dreaming of the potential of Metaverse applications. All of this will eventually be built into AR glasses with AI assistants, he posted on Facebook today. It may help you cook dinner, notice lack of ingredients, reminding you to turn down the the heat, or do more complex tasks.
The key point for Auli is that researchers need to get out of the silo. Hey, you don’t have to focus on one thing, he says. Having a good idea can help in every way.