AI is completely changing how we interact with technology and the internet, and it has already produced advances in robotics and medication development. The only issue is that we don’t fully understand how or why it functions so effectively. The specifics are too complicated to dissect, but we have a good idea. The issue is that we might use an AI system in a very delicate area like medicine without realizing that it might have serious flaws in its workings.
Recently, a group at Google DeepMind that focuses on mechanical interpretability has been developing new methods to allow us to see inside. The tool Gemma Scope, which helps researchers comprehend what happens when AI generates an output, was released at the end of July. It is hoped that a deeper comprehension of the inner workings of an AI model would enable us to better regulate its outputs, resulting in future AI systems that are more advanced.
Neel Nanda, who leads Google DeepMind’s mechanistic interpretability team, says, “I want to be able to look inside a model and see if it’s being deceptive.” “It should be beneficial to be able to read a model’s mind,”
A relatively recent field of study called “mechanistic interpretability,” or “mech interp,” seeks to comprehend the true operation of neural networks. Basically, what we do now is feed a model a large amount of data, and at the conclusion of training, we receive a number of model weights. A model’s decision-making process is determined by these parameters. We know something about the relationship between the model weights and the inputs: In essence, the AI is analyzing the data to identify patterns and draw conclusions from them, but these patterns can be quite complicated and frequently difficult for humans to understand.
It is like to a teacher going over the solutions to a challenging arithmetic problem during an exam. Though the work appears to be a collection of squiggly lines, the student—in this example, the AI—wrote down the right response. This example makes the assumption that the AI always gets the right answer, but that isn’t necessarily the case; the AI student might have discovered a pattern that isn’t significant and is taking it for granted. For instance, some AI systems in use today will tell you that 9.11 is bigger than 9.8. In essence, the squiggly lines are being made sense of by various techniques created in the field of mechanical interpretability, which are starting to provide some insight into what might be going on.
“Trying to reverse-engineer the algorithms inside these systems is a key goal of mechanistic interpretability,” Nanda explains. “We give the model a prompt, like ‘Write a poem,’ and then it writes some rhyming lines. What algorithm did it use to accomplish this? We would really like to comprehend it.
Selecting how granular you want to go with sparse autoencoders is the challenging part. Consider the microscope once more. You can enlarge anything so much that it may be impossible for a human to understand what you’re looking at. However, you might be restricting the amount of interesting stuff you can see and learn if you zoom out too far.
Using sparse autoencoders of variable sizes and adjusting the number of characteristics they wanted the autoencoder to find was DeepMind’s method. It was not intended for DeepMind’s researchers to independently conduct a full analysis of the findings. Since Gemma and the autoencoders are open-source, the goal of this effort was to encourage scholars to examine the results of the sparse autoencoders in the hopes of gaining fresh understanding of the internal logic of the model. A researcher could map the input-to-output progression to a level never previously possible because DeepMind implemented autoencoders on each layer of their model.
Josh Batson, an Anthropic researcher, believes this is really intriguing for interpretability researchers. A lot of interpretability research may now be conducted on the basis of those sparse autoencoders if you have this model open-sourced for others to examine. It makes it easier for people to learn using these techniques.
In July, DeepMind and Neuronpedia, a platform for mechanistic interpretability, collaborated to create a Gemma Scope demo that is currently available for use. You can experiment with various prompts in the demo to observe how the model divides them up and what activations cause them to light up. Additionally, you can play around with the model. When you ask the model a question about US presidents, for instance, and you crank the feature about dogs way up, Gemma will either find a way to incorporate a random dog-related ramble or the model may simply begin barking at you.
One intriguing aspect of sparse autoencoders is their unsupervised nature, which allows them to identify features independently. Surprising findings regarding how the models deconstruct human conceptions follow from this. According to Joseph Bloom, research lead at Neuronpedia, “the cringe feature is my personal favorite feature.” It seems to show up in critical assessments of films and books. It’s simply an excellent illustration of tracking things that are, in a sense, so human.
When you search for concepts on Neuronpedia, it will show you which characteristics are enabled on particular tokens (words) and to what extent. The model believes that the cringe notion is most pertinent when you read the text and notice what is highlighted in green.
It’s getting easier to track some features than others. According to Johnny Lin, the founder of Neuronpedia, “deception is one of the most important features that you would want to find for a model.” Finding “Oh, there’s the feature that fires when it’s lying to us” is not that simple. I have seen that we have not been able to identify dishonesty and outlaw it.”
The research conducted by Anthropic, another AI startup, with Golden Gate Claude back in May is comparable to that of DeepMind. It identified the portions of its model, Claude, that lit when talking about the Golden Gate Bridge in San Francisco using sparse autoencoders. The bridge-related activations were subsequently intensified to the point where Claude would react to cues as the actual Golden Gate Bridge and would genuinely identify as the bridge rather than Claude, an AI model.
Despite its eccentricity, mechanistic interpretability research could be quite helpful. According to Batson, these qualities are quite useful as a tool for figuring out the model’s level of abstraction and how it generalizes.
Sparse autoencoders, for instance, were utilized by a team led by Samuel Marks, who is currently at Anthropic, to identify traits that demonstrated a given model was linking various genders to particular professions. To lessen bias in the model, they subsequently disabled these gender-related variables. Since this experiment was conducted on a relatively small model, it is uncertain whether the findings will hold true for a much larger model.
We can learn more about why AI makes mistakes by conducting mechanistic interpretability research. Transluce researchers saw that the question was activating the sections of an AI model associated with Bible verses and September 11 in the case of the claim that 9.11 is larger than 9.8. By claiming that 9/11 was a later date than 9/8, the researchers came to the conclusion that the AI might be reading the numbers as dates. Section 9.11 may be seen as greater by the AI because it appears after section 9.8 in many literature, including religious writings. The model responded correctly when asked again if 9.11 is greater than 9.8 after the researchers adjusted the AI’s activations based on Bible verses and September 11 after they understood why the AI had made this mistake.
Other possible applications exist as well. LLMs currently have a system-level prompt built in to handle scenarios such as people inquiring about how to construct a bomb. OpenAI initially surreptitiously prompts ChatGPT to stop from instructing you on how to construct bombs or perform other evil tasks when you ask it a question. However, with clever suggestions, users may easily circumvent such limitations and jailbreak AI models.
Theoretically, the designers of the models might permanently disable those nodes if they could detect where the bomb-building knowledge is located in an AI. The AI would literally have no knowledge of how to construct a bomb in its system, thus even the most ingenious query would not elicit an answer regarding how to do it.
With the current level of mechanistic interpretability, it is quite difficult to obtain the kind of granularity and exact control that is easy to imagine.
One drawback is that steering—the process of changing a model’s characteristics to influence it—just does not work that well. As a result, when you steer a model to make it less violent, it ends up loosing all of its martial arts proficiency. Steering needs a lot of improvement, according to Lin. For instance, in an AI model, the ability to “bomb make” is more than just an on/off switch. Since it is undoubtedly integrated into several aspects of the model, turning it off would most certainly impair the AI’s chemistry comprehension. Any tinkering could have advantages, but there would also be serious disadvantages.
However, DeepMind and others are optimistic that mechanical interpretability could provide a viable path to alignment—the process of ensuring AI is truly performing the tasks we want it to perform—if we can go deeper and peek more clearly into the “mind” of AI.