According to recent study, a human consultant cannot be completely replaced by an AI agent—at least not just yet.
The massive AI training company Mercor evaluated how successfully top AI models, posing as agents, handled actual banking, legal, and consulting jobs. Brendan Foody, CEO of Mercor, told that while the models often failed, the data only partially explains the situation.
The consulting tasks in Mercor’s APEX-Agents benchmark were created with input from consultants at McKinsey, BCG, Deloitte, Accenture, and EY as well as expert surveys to replicate actual management consulting work.
The AI agents performed the tasks fewer than 25% of the time on their first attempt across all task categories. Only 40% of the jobs could be finished by the agents in eight tries. With over 23% of the management consulting assignments completed on its first try, OpenAI’s GPT 5.2 originally outperformed the others. This week’s release of Anthropic’s Opus 4.6 showed even higher results, coming in at about 33%.
While many jobs were not finished, Foody reported that the success rate for GPT 3 was only 3%, compared to 23% for GPT 5.2. In just a few months, Anthropic’s model increased from 13% to 33% on consulting activities. Foody anticipates that the models will achieve a success rate of around 50% by year-end.These are some of the most difficult tasks in the economy that individuals pay millions of dollars to consulting firms to complete, and the models are finally able to do it at an astounding pace of progress,” Foody said.
Although AI has already upended the consulting sector by altering how businesses recruit and generate revenue, the possibility of agents replacing consultants increases as the models get better.
Bob Sternfels, the head of McKinsey, reportedly stated that 25,000 of the 60,000 workers at the esteemed management consulting firm were artificial intelligence (AI) agents.
According to Sternfels, McKinsey is expanding without adding more employees for the first time in its history.
Where AI agents fail in consulting jobs
Mercor tested a number of frontier models, including those from Google, Anthropic, and OpenAI.
In one example consulting work, the AI agent was asked to “analyze category consumption patterns and market penetration using the Category Penetration Score methodology for PureLife’s portfolio strategy,” and it was expected to provide a number of particular outputs. The AI agents were unable to generate a precise response. According to the findings, “no model is prepared to replace a professional end-to-end.”
According to Foody, Mercor discovered that the AI agents performed well in research and data analysis.
Where they constantly failed was on longer-horizon tasks — the longer it would take a human to finish a task, or the more steps it required, the greater the likelihood that the model would struggle.
Unlike humans, Foody explained, the models struggle to comprehend where in a certain file system they should look for the relevant information, thus they frequently wind up looking at the wrong files. They struggle with the planning aspect of figuring out how to use several tools and cross-reference files simultaneously.
The models work reasonably well for jobs that can be completed in an hour or less, or that only need one tool.
According to Foody, the agents are similar to interns in that they may pass with a 50% passing rate, and the partner continues to find several problems with the work.
According to Frank Jones, a former consultant for KPMG who is currently an expert contractor for Mercor, he has trained AI and discovered that while the models may approximate certain jobs, human refining is frequently required.
Additionally, he stated that the models require highly particular cues because they may not always comprehend typical consulting terms or expectations, such as “client-ready.”The majority of consultants are aware of that. However, I believe there is a lot of complexity in that regard for AI,” he stated.
The AI models are rapidly getting better
Foody asserts that the frontier laboratories are already making significant investments in more and better training, which is what is needed to keep improving the models rather than a breakthrough. That’s why we make so much revenue,” he remarked, adding, “We’re in the business of replacing human judgment.”
In the fall, Mercor, whose clients include OpenAI, Anthropic, and Meta, obtained a funding agreement worth $10 billion. Mercor employs over 30,000 contractors throughout the world to assist train AI models by rewriting chatbot responses. Previously, Foody stated that its sales will increase by 4,658% in 2025.
Foody believes that consulting, particularly lower-level employment, will be among the jobs eliminated by AI. He stated that the next iteration of the AI agents benchmark will expand to examine the entire value chain of a professional services firm: “Instead of evaluating the analyst, we’re evaluating McKinsey itself.”
He claims that McKinsey finds Mercor’s AI agent benchmark to be an interesting story at the moment since it demonstrates that AI can be used to enhance human capabilities rather than to replace them. “The next version of APEX tells a very scary story for McKinsey,” he said, adding, “In the next two years, we’ll have chatbots as good as the best consulting firm.”






