The performance of complicated activities that were previously thought to be the purview of human intelligence has advanced dramatically in recent years thanks to artificial intelligence (AI). AI systems like GPT-4 and PaLM 2 have outperformed humans on a variety of criteria, including passing the bar exam, aced the SAT, mastered language skills, and correctly identified medical imagery.
Benchmarks are essentially standardized examinations that assess how well AI systems perform against predetermined objectives and activities. Although they are frequently used by academics and developers to compare and enhance various models and algorithms, a recent report published in Science questions the relevance and validity of many current benchmarks for assessing AI systems.
The study makes the case that benchmarks frequently miss the true capabilities and constraints of AI systems, which can result in inaccurate or misleading judgements regarding their dependability and safety. For instance, benchmarks might not take into account how AI systems manage ambiguity, hostile inputs, or uncertainty. They might also not accurately depict how AI systems relate to people or other systems in complex and dynamic contexts.
When determining where these systems are safe to employ, this presents a significant challenge. Furthermore, the community has to reconsider its approach for assessing new models in light of the mounting pressure on businesses to incorporate powerful AI systems into their products.
The requirement for aggregate metrics
Researchers and developers must make sure they comprehend what a system is capable of and where it fails in order to create AI systems that are safe and equitable.
Ryan Burnell, lead author of the work and an AI researcher at the University of Cambridge, told that in order to develop that knowledge, “we need a research culture that is serious about both robustness and transparency.” “However, we believe that the research culture is currently lacking on both fronts.”
The use of aggregate metrics that summarize an AI system’s general performance on a class of tasks, such as math, reasoning, or picture classification, is one of the major issues that Burnell and his co-authors highlight. Because they are so straightforward, aggregate measurements are useful. But the ease of use is sacrificed for a lack of transparency and information on some of the nuances of the AI system’s performance on crucial jobs.
It’s not always simple to evaluate and transmit data when you have it from dozens of tasks and perhaps thousands of different instances of each activity. According to Burnell, aggregate metrics make it possible to explain the results in a clear, basic manner that readers, reviewers, or, as we’re seeing now, consumers, can immediately grasp. The issue with this simplification is that it might obscure potentially significant patterns in the data, patterns that might point to potential biases, safety issues, or even just provide us with information about how the system functions.
Aggregate benchmarks can fail in many different ways. As an illustration, a model may do well overall on a benchmark but poorly on a subset of jobs. According to research on commercial facial recognition systems, models with very high overall accuracy had trouble recognizing faces with darker skin tones. In other instances, the model could discover incorrect patterns, such as identifying items based on their backgrounds, watermarks, or other artefacts unrelated to the primary objective. Things can become considerably more difficult when using large language models (LLM).
This challenge is getting worse since the range of capabilities that need to evaluate is getting wider as huge language models become more general-purpose, according to Burnell. As a result, when they combine all the data, they are merging apples and oranges in an illogical way.
According to various studies, LLMs that excel in difficult activities struggle miserably with much simpler ones. For example, they may solve difficult maths problems but give the incorrect solutions if the same problem is asked in a different way. According to other studies, the same models are ineffective at solving simple problems that a person would need to master before learning more difficult ones.
The bigger issue, according to Burnell, is that we risk becoming overly confident in the capacities of our systems and using them in circumstances when they are neither safe nor dependable.
One of the well publicized accomplishments of the GPT-4 technical report, for instance, is the model’s capacity to pass a mock bar exam and place in the top 10% of test takers. The paper, however, doesn’t go into specifics about the problems the model had with certain queries or activities.
According to Burnell, we would not want to trust the system in a situation with such high stakes if certain activities are crucial or arise regularly. He is not saying ChatGPT can’t be helpful in legal situations, but knowing that it scored in the 90th percentile on the bar exam is insufficient to make wise conclusions regarding this matter.
Granular data can enhance AI assessment
In their study, Burnell and his co-authors also point out the absence of instance-by-instance evaluation reporting as an issue. It will be extremely challenging for independent researchers to confirm or validate the findings presented in papers without access to detailed information on the instances used to test the model.
From the standpoint of accountability, evaluation transparency is crucial. It’s crucial that the community has a way of independently examining evaluation data in order to assess the resilience of systems and look for any biases or failure points, according to Burnell. However, from a scientific standpoint, making evaluation results available to the public is really valuable.
However, accessing instance-by-instance evaluation is becoming more and more challenging. Only a small portion of papers published at top AI conferences, according to one survey, give granular access to test cases and results. Additionally, the costs of inference and the quantity of test instances required make it costly and time-consuming to evaluate cutting-edge systems like ChatGPT and GPT-4.
As a result, without this information, other researchers and decision-makers must either invest a significant amount of money in carrying out their own experiments or accept the published results at face value. The researchers could have avoided a lot of wasteful expenses, though, if they had made their evaluation data public. Additionally, it has become simpler and significantly less expensive to publish research data as a result of the expanding number of sites that allow users to upload evaluation results.
According to Burnell, there are a variety of uses for evaluation data that the researchers doing the initial evaluation might not have considered, particularly when it comes to the standardized standards that are prominent in AI. Other researchers won’t have to spend time and resources repeating the evaluation if the data are made available, allowing them to readily conduct supplementary studies.
Where is the field headed?
A number of recommendations are offered by Burnell and his co-authors to aid with the challenge of better comprehending and assessing AI systems. Publishing detailed performance reports with breakdowns across the issue space’s features is one of the best practices. The community should also work on developing new benchmarks that examine certain competencies rather than combining multiple skills into a single measure. Additionally, researchers ought to be more open about documenting their experiments and making them accessible to the public.
According to Burnell, the academic community is generally heading in the correct way. For instance, conferences and journals are beginning to advise or demand that authors upload their code and data along with their submitted articles.
Burnell stated that certain businesses, including Hugging Face and Meta, are working hard to stay in line with the best practices recommended by the wider community, such as opening-source data and models and sharing model cards that detail how a model was trained.
However, at the same time, the commercial AI market is moving away from transparency and sharing.
Burnell cautions against this new mindset, however, because it will encourage businesses to ignore the flaws and failures in their models and to cherry-pick evaluation data that makes it appear as though their models are really reliable and powerful.
Given how popular these models are becoming and the incredibly broad range of applications they can be used for, Burnell believes we are in a potentially very dangerous situation and is concerned about our ability to fully comprehend the capabilities and limitations of these systems. He believes that we must work hard to ensure that independent groups have access to these systems in order to properly evaluate them, and that legislative solutions are likely to be a significant element of the issue here.