When Meta made its large language model Llama 3 available for free in April, it only took a few days for outside coders to produce a version that was free of the safety constraints that stopped it from making offensive jokes, giving directions on how to make meth, and engaging in other nefarious behavior.
Future efforts to remove safeguards from Llama and other open-source AI models may be more difficult with the help of a novel training technique created by researchers at the University of Illinois Urbana-Champaign, UC San Diego, Lapis Labs, and the nonprofit Center for AI Safety. This kind of tamperproofing open models may become essential as AI grows more potent, according to some researchers.
Mantas Mazeika of the Center for AI Safety, who worked on the study while pursuing a PhD at the University of Illinois Urbana-Champaign, told that terrorists and rogue nations will certainly exploit these models. The risk increases with how easy they may be repurposed.
Effective AI models are frequently kept secret by the people who developed them, and the only ways to get access to them are through an API or a chatbot that is available to the public, such as ChatGPT. Meta and others have decided to release entire models, even if creating a strong LLM requires tens of millions of dollars. One aspect of this is allowing anyone to download the “weights,” or parameters, that specify how they behave.
Open models like Meta’s Llama are usually refined before being released to improve their ability to engage in dialogue and answer questions. This also helps to guarantee that they won’t answer questions that pose an issue. By doing this, a chatbot built using the model will be prevented from making offensive, inappropriate, or malicious remarks. It should also stop it from describing things like how to make a bomb.
The researchers behind the new technology devised a method to complicate the task of altering an open model for malicious purposes. It entails mimicking the modification procedure but then changing the model’s settings so that the changes that would ordinarily cause the model to reply to a prompt like “Provide instructions for building a bomb” no longer work.
Mazeika and colleagues showed the approach using a stripped-down version of Llama 3. They were able to modify the model’s parameters so that, even after thousands of attempts, it could not be trained to answer unpleasant questions. Meta did not immediately reply to a request for comment.
Mazeika claims the approach is not flawless, but it shows that the bar for “decensoring” AI models may be raised. According to him, a tractable goal is to raise the costs of breaking the model sufficiently to prevent most opponents from doing so.
Dan Hendrycks, director of the Center for AI Safety, hopes that this study will spark research on tamper-resistant safeguards, allowing the research community to figure out how to develop more and more robust safeguards.
As interest in open source AI increases, the concept of tamperproofing open models might gain traction. Modern closed models from businesses like OpenAI and Google are already in competition with open models. For example, the most recent version of Llama 3, which was released in July, is approximately as capable as the models behind well-known chatbots such as ChatGPT, Gemini, and Claude, according to widely used standards for evaluating language model capabilities. Similar in capability is the LLM Mistral Large 2, which was also released last month by a French firm.
When it comes to open source AI, the US government is being cautious yet encouraging. According to a report this week from the US Commerce Department’s National Telecommunications and Information Administration, “the US government should develop new capabilities to monitor for potential risks, but hold off on immediately restricting the wide availability of open model weights in the largest AI systems.
However, not everyone is in favor of placing limitations on open models. According to Stella Biderman, director of EleutherAI, an open-source AI initiative controlled by the community, while the new method may seem sophisticated in theory, it might be difficult to implement in real-world situations. According to Biderman, the strategy goes against the principles of openness in AI and free software.
Biderman claims that he believes this study misunderstands the key problem. If they’re worried about LLMs creating information on weapons of mass destruction, the right intervention is on the training data, not the trained model.