In tests meant to gauge human creativity, AI is becoming better at passing. An evaluation of this skill using the Alternate Uses Task revealed that AI chatbots performed better on average than humans in a study.
This discovery will rekindle a long-running argument among AI experts about what it even means for a machine to succeed in tests designed for humans. The results do not necessarily imply that AIs are gaining the capacity to perform tasks that are only human. AIs may just be able to pass creativity tests without actually being creative in the sense that we interpret the term. However, studies like this one might help us comprehend how creative activities are approached by both humans and machines.
Researchers first challenged three AI chatbots—OpenAI’s ChatGPT, GPT-4, and Copy.Ai, which is based on GPT-3—to think of as many uses for a rope, a box, a pencil, and a candle in only 30 seconds.
The huge language models were given instructions to come up with unique and imaginative uses for each of the things, with the explanation that the quality of the ideas was more crucial than the quantity. For each of the four things, each chatbot underwent 11 tests. 256 human subjects received the same instructions from the researchers.
To compare AI and human replies, the researchers used two different techniques. The first was an algorithm that determined how closely the suggested usage of the object matched the reason for which it was created. The second comprised having six human assessors rank each response on a scale of 1 to 5 for creativity and originality, with 1 being not at all creative and original and 5 being very, while being completely unaware that some of the replies had been produced by AI systems. Then average scores were calculated for both humans and the AIs.
The highest-scoring human responses were higher, despite the fact that on the whole, chatbot responses were judged as superior to human ones.
The study, which was co-led by Simone Grassini, an associate professor of psychology at the University of Bergen in Norway, raises philosophical issues regarding the qualities that are particular to humans even though its goal was not to demonstrate that AI systems can take the place of people in creative roles.
When it comes to mimicking human behavior, we’ve demonstrated that technology has advanced significantly over the previous few years, he claims. These simulations are always evolving.
Ryan Burnell, a senior research associate at the Alan Turing Institute who was not involved in the research, claims that showing that machines can perform well in activities intended to measure creativity in humans does not prove that they are capable of anything resembling original thought.
He claims that because the chatbots that were tested are “black boxes,” we don’t know precisely what data they were trained on or how they come up with their responses. He notes that what’s very likely happening in this case is that a model was just pulling on what it has observed in its training data, which may include this specific Alternate Uses Task, rather than coming up with new innovative ideas. We are not measuring creativity in that scenario. We are assessing the model’s prior experience with this kind of work.
Even yet, according to Anna Ivanova, an MIT postdoctoral researcher working on a different project who studies language models and who was not involved in the project, it is still beneficial to compare how humans and machines tackle particular challenges.
Although chatbots are quite good at completing specific requests, she warns that even small changes, like rephrasing a suggestion, may be enough to prevent them from working as well. Ivanova thinks that studies of this nature should lead us to investigate the relationship between the task we want AI models to perform and the cognitive ability we’re seeking to measure. She says that we shouldn’t presume that individuals and models approach problems in the same manner.