The Public Ranked 27 AI models, and ChatGPT Came in at Number Eight

Despite the fact that the field of artificial intelligence can often feel like the Wild West, a surprising amount of testing, benchmarking, and analysis takes place behind the scenes. Not just from firms themselves, but also from organizations created to develop their own rankings.

These teams assess everything from a chatbot’s emotional intelligence to its capacity to solve mathematical problems, produce visuals, demonstrate reasoning, provide medical advice, and more.

Models fluctuate in these tests, displaying their advantages and disadvantages in various domains. GPT-5, for instance, lags behind Gemini and Claude in terms of adaptability to novel ideas, while being excellent at scientific reasoning.

Each of these tests provides new insights into AI models and serves as a useful reminder of the most appropriate tool for certain situations. However, one measure is frequently absent. To put it simply, which AI models provide the greatest user experience?

Humaine is an AI leaderboard created by Prolific, a software company located in the UK. Rather than assessing AI’s task completion capabilities, Prolific investigated how the models were experienced by various users.

Through the evaluation of 21,352 individuals’ experiences with the tools, they were able to identify a winner overall as well as split down the findings by political convictions, location (tested in the US and the UK), and age.

This comprises separate listings for:

  • UK: age groups
  • UK: Ethnicity
  • UK: Political view
  • US: age groups
  • US: Ethnicity
  • US Political view

Each participant was given two different AI models to compare, and the researchers asked them to judge which model performed better in each encounter.

As a result, there was a champion and scoreboard for overall performance, but there were also distinct rankings for reasoning and performance on fundamental tasks, as well as winners for communication, fluidity, trust, and ethics.

What are the findings?

Following the survey, a clear victor emerged in the majority of the subcategories as well as the overall performance category. In practically all of the test’s filters, Gemini 2.5-Pro performed best.

Voters over 55 in the US, Democrat voters, and UK voters aged 18 to 34 all agreed that the Gemini 2.5 Pro was the best model overall. Considering some of the recent safety and ethical problems the AI model has encountered, it is rather amusing that Grok-3 was the only category in which all demographic groups placed anything higher than Gemini.

It is interesting to note that Deepseek, Magistral Le Chat, and Grok are the three models that appear following Gemini. While Deepseek was quite popular early this year, it hasn’t been as popular lately. Conversely, there is a devoted following for the less well-known chatbot Le Chat.

So, where is the world-renowned ChatGPT in all of this? It’s a long scroll down, but the GPT-4.1 model comes out on top. Even worse, Claude’s two version 4 models rank 11th and 12th overall.

What does all of this imply, then?

Does this imply that Gemini is the world’s greatest AI chatbot? Does this imply that you ought to stop using ChatGPT? Well, not precisely.

The performance of these models is not always reflected in these findings. ChatGPT, Gemini, Claude, and Grok are often the alternatives that rank highest when examined on the majority of other criteria.

This is a significant addition to these tests, though. It aids in improving comprehension of AI from a more human experience standpoint. Despite having lower benchmark scores, Le Chat is often ranked as a top choice for trust and experience.

Even though Anthropic and OpenAI perform poorly in this testing round, Gemini and Grok both managed to achieve another impressive result. Both companies routinely receive good scores on benchmarks and have done so here as well.

Source link