The Swiss Army knife has remained popular for over a century due to the convenience of a single tool with multi-purpose capabilities. A DeepMind research team introduces an AI take on this concept in the new paper A Generalist Agent, proposing Gato, a single general-purpose agent that can perform over 600 varied tasks ranging from captioning images to stacking blocks with a real robot arm and maneuvering simulated 3D environments — all while utilizing the same network with the same weights. Gato, a novel transformer sequence model, even outperforms human players in Atari games.
The DeepMind researchers begin with the hypothesis that it is possible to train an agent that is generally capable of performing a large number of tasks, and that this general agent can be adapted with little extra data to perform even more tasks. They point out that a general-purpose agent offers significant benefits, such as eliminating the need to hand-craft policy models for each field, increasing the amount and diversity of training data, and achieving continuous improvements at the data, compute, and model scale frontiers.
Gato is intended to be trained on as wide a range of relevant data as possible. Gato functions as a multi-modal, multi-task, multi-embodiment generalist policy model by processing massive multi-modal data and serializing it into a flat sequence of tokens. It can adapt to and succeed at tasks with varying modalities, observations, and action specifications, and handle new tasks given minimal additional data.
Gato was trained using a large number of datasets that included agent experience in both simulated and real-world settings. MassiveText, a multi-modal text dataset that includes web pages, books, and news articles, as well as code and vision-language datasets such as ALIGN (Jia et al., 2021) and COCO captions, were used for vision and language training (Chen et al., 2015).
Gato was tested on several tasks, including simulated control, robotic stacking, and ALE Atari games. Gato exceeded the 50% expert score threshold on 450 of the 604 tasks in the experiments.
Overall, this work demonstrates the utility of Gato-like transformer sequence models as multi-task multi-embodiment policies for real-world text, vision, and robotics tasks, as well as their utility in few-shot, out-of-distribution task learning capabilities. The researchers hope that such models will be used as default starting points for learning new behaviors rather than training from scratch in the future.
While the proposed Gato has reignited the AGI debate on Twitter, it is worth noting that the term “AGI” does not appear in the DeepMind paper or accompanying blog post, which uses the less ambitious “general-purpose agent” descriptor.