The hunt is on for new architectures following years of domination by the transformer type of AI.
Transformers are the fundamental building blocks of text-generating models such as Google’s Gemini, Anthropic’s Claude, and OpenAI’s GPT-4o. They also support the company’s video-generating model Sora. However, they are now encountering technical obstacles, namely those pertaining to computation.
In terms of processing and interpreting large volumes of data, transformers aren’t very effective—at least when using off-the-shelf technology. And as a result, as businesses develop and extend infrastructure to meet the needs of transformers, the demand for electricity is rising sharply and may not be sustained.
Test-time training (TTT) is a promising architecture that was created over a year and a half by researchers at Stanford, UC San Diego, UC Berkeley, and Meta. It was proposed this month. According to the study team, TTT models have the ability to analyze significantly more data than transformers while using almost half as much computing power.
The hidden state in transformers
Transformers rely heavily on the “hidden state,” which is effectively a lengthy array of data. As a transformer processes something, it adds entries to the hidden state to “remember” what it has just done. For example, if the model is working its way through a book, the hidden state values will include representations of words.
If you consider a transformer to be a sentient entity, the lookup table — its hidden state — is its brain, according to Yu Sun, a post-doc at Stanford and co-author of the TTT study. This specialized brain supports transformers’ well-known capabilities, such as in-context learning.
Transformers’ concealed state contributes to their potency. However, it also limits them. The model would have to search through its complete lookup table in order to “say” even a single word about a book that a transformer had just finished reading. This would need as much processing power as reading the full book again.
Thus, Sun and colleagues came up with the idea of using a machine learning model to replace the hidden state; think of them as nesting dolls of AI, or model within a model.
The basic idea—which may seem a little technical—is that the TTT model’s internal machine learning model doesn’t keep growing as it processes more data, in contrast to a transformer’s lookup table. Rather, TTT models are very performant because the data they process is encoded into representative variables called weights. The size of a TTT model’s internal model remains constant regardless of the amount of data it processes.
Future TTT models, according to Sun, could effectively interpret billions of data points, including texts, photos, audio files, and movies. That is much more than what the models of today are capable of.
According to Sun, their system can discuss a book in X words without requiring the computational overhead of reading it X times. Large transformer-based video models, like Sora, are limited to processing 10 seconds of footage due to their “brain” being a lookup table. Their ultimate objective is to create a system that is capable of processing an extended movie that closely resembles the visual experience of a human life.
Skepticism regarding the TTT models
Will TTT models someday become obsolete? They were able to. It’s too soon to say for sure, though.
Transformers cannot be swapped out for TTT models. Furthermore, because the researchers only created two small models for their research, it is currently challenging to compare TTT as a technique to some of the more extensive transformer implementations that are available.
Mike Cook, senior lecturer in the informatics department at King’s College London, who was not involved in the TTT research, said he thought the innovation was perfectly interesting and that it would be great news if the data supported the claims that it increased efficiency. However, he could not determine whether the new architecture was better than the existing ones. When he was an undergraduate, an old professor of him used to quip, “How do you solve any problem in computer science?” Increase the abstraction level one further. That’s clearly brought to mind by adding a neural network inside a neural network.
Regardless, the increased pace of research into transformer replacements indicates a growing awareness of the need for a breakthrough.
Mistral, an AI firm, unveiled a model this week called Codestral Mamba, which is based on another transformer alternative known as state space models (SSMs). SSMs, like TTT models, appear to be more computationally efficient than transformers and can handle larger datasets.
AI21 Labs is also investigating SSMs. So is Cartesia, which pioneered some of the early SSMs, as well as Mamba and Mamba-2, the namesakes of Codestral Mamba.
If these attempts succeed, generative AI may become even more accessible and widespread than it is today – for better or worse.