Even the finest AI models hallucinate

Every generative AI model, even the most recent covert version of OpenAI’s GPT-4o and Google’s Gemini and Anthropic’s Claude, experiences hallucinations. In other words, the models are unreliable narrators, sometimes to comic effect and sometimes problematically so.

However, not every model invents things at the same pace. Furthermore, the falsehoods they propagate vary depending on the information sources they have encountered.

In order to benchmark hallucinations, researchers from Cornell, the universities of Washington and Waterloo, as well as the nonprofit research institute AI2, recently conducted a study in which they fact-checked models such as GPT-4o against reliable sources on a variety of subjects, from geography and law to history and health. It was discovered that no model exhibited remarkable performance on all themes, and the models who experienced the least amount of hallucinations did so in part due to their refusal to provide incorrect answers to queries.

Wenting Zhao, a Cornell doctorate student and co-author of the study, told that the most significant lesson to be learned from their work is that we still can’t fully trust the results of model generations. Even the most advanced models can now only produce text free of hallucinations roughly 35% of the time.

Other scholarly investigations on the “factuality” of models have been made; one such investigation was conducted by a team unrelated to AI2. Although most models are trained using Wikipedia data, Zhao points out that these earlier experiments gave the models questions whose solutions could be readily accessed on Wikipedia, so they weren’t exactly the hardest questions to ask.

In order to increase the difficulty of their benchmark and better represent the kinds of queries individuals pose of models, the researchers looked for online subjects without a Wikipedia entry. Approximately 50% of the test’s questions are unanswerable with Wikipedia assistance (though some are, just for good measure). The areas included in the test include pop culture, astronomy, pop culture, finance, medicine, computer science, geography, and celebrities.

The researchers examined more than a dozen well-known models for their investigation, many of which had only been published a year earlier. They examined “open” models like Meta’s Llama 3 70B, Mistral’s Mixtral 8x22B, and Cohere’s Command R+ in addition to GPT-4o. They also tried gated-behind-API models like Google’s Gemini 1.5 Pro, Anthropic’s Claude 3 Opus, and Perplexity’s Sonar Large, which is based on Llama.

The findings indicate that, in spite of assertions to the contrary from OpenAI, Anthropic, and other major participants in generative AI, models aren’t hallucinating all that much these days.

When compared to OpenAI’s considerably older flagship GPT-3.5, GPT-4o and GPT-3.5 performed similarly in terms of the percentage of questions successfully answered on the benchmark. (GPT-4o was just slightly superior.) The least hallucinatory models overall were those from OpenAI, Command R, Mixtral 8x22B, and Perplexity’s Sonar models.

The models struggled most with questions about celebrities and money, but they had the smoothest time answering questions about geography and computer science (maybe because these subjects were more frequently mentioned in their training data). Every model (but particularly GPT-3.5 and GPT-4o) answered less factually on average when the answer’s source wasn’t Wikipedia, indicating that Wikipedia content is a major source of information for all of them.

Even online search-capable models, such as Perplexity’s Sonar models and Command R, had trouble answering “non-Wiki” questions in the benchmark. It didn’t really matter what big the model was; smaller models (like Claude 3 Haiku from Anthropic) hallucinated about as often as larger, seemingly more capable models (like Claude 3 Opus).

What does all of this mean, and where are the promised improvements from the vendors?

It wouldn’t surprise us if vendors inflated their promises, though. From a charitable perspective, however, the benchmarks they are employing are inappropriate for this use. As we’ve discussed previously, a large number of AI evaluations—if not the majority of them—are transient and lacking in crucial context, making them susceptible to Goodhart’s law.

Still, Zhao states that she anticipates the hallucination problem to “persist for a long time.”

The actual improvement achieved with these strategies is limited, she added, despite the promise of certain methods to minimize or eliminate hallucinations, according to empirical results in our article. Furthermore, our investigation shows that contradicting information can frequently be found even on the internet, in part due to the possibility of hallucinations in the training data, which is authored by humans.

An interim approach could be to simply train models to refuse to respond more frequently – the technological equivalent of asking an expert to knock it off.

Claude 3 Haiku only responded to about 72% of the questions in the tests conducted by the researchers; the remainder were left blank. Claude 3 Haiku was actually the most factual model among all of them when it came to explaining the abstentions, at least in the sense that it lied the least.

Will people adopt a model that just answers a few questions? Zhao disagrees, arguing that vendors should devote more time and resources to hallucination-reducing research. She claims that while it is impossible to completely eliminate hallucinations, they can be reduced through human-in-the-loop fact-checking and citation during model development.

Zhao emphasized the importance of developing policies and laws to ensure that human experts are constantly involved in the process of verifying and validating the information provided by generative AI models. There are still various potential to make important contributions in this sector, such as developing advanced fact-checking techniques for all free text, providing citations for factual content, and providing corrections for hallucinated texts.

Source link