An AI Coding Contest has released Surprising Results

The first winner of a new AI coding contest has been announced, raising the standard for software developers using AI.

The nonprofit Laude Institute revealed the inaugural winner of the K Prize, a multi-round AI coding competition started by Andy Konwinski, co-founder of Databricks and Perplexity, on Wednesday at 5 p.m. PST. A Brazilian prompt engineer named Eduardo Rocha de Andrade emerged victorious and will get $50,000 as compensation. His final score was more unexpected than the victory, though, as he only answered 7.5% of the test’s questions correctly.

“They’re happy that they created a really challenging benchmark,” Konwinski added. If benchmarks are to be significant, they must be challenging. $1 million has been promised by Konwinski to the first open-source model to pass the test with a score of 90% or above.

The K Prize evaluates models against GitHub issues that have been highlighted as an indicator of how well they can handle actual programming challenges, much as the popular SWE-Bench method. The K Prize is intended to be a “contamination-free version of SWE-Bench,” employing a timed entry method to prevent any benchmark-specific training, in contrast to SWE-Bench, which is based on a predetermined set of tasks that models can train against. Models for round one were to be submitted by March 12th. Using only GitHub bugs reported after that date, the K Prize organizers proceeded to construct the test.

The 7.5% highest score stands in stark contrast to SWE-Bench, which presently has a 75% top score on its simpler ‘Verified’ exam and 34% on its more difficult ‘Full’ test. Konwinski is still unsure if the gap is due to contamination on SWE-Bench or just the difficulty of gathering new issues from GitHub, but he expects the K Prize project to respond shortly.

“As we get more runs of the thing, we’ll have a better sense,” he told the tech publication, “because we expect people to adapt to the dynamics of competing on this every few months.”

It may appear to be an unusual area to fall short, considering the huge range of AI coding tools already available to the public – but with benchmarks becoming too simple, many critics regard efforts like the K Prize as an essential step toward fixing AI’s increasing assessment problem.

A similar approach was recently proposed by Princeton researcher Sayash Kapoor, who states, “I’m quite bullish about building new tests for existing benchmarks.” “We can’t really tell if the problem is contamination or even just aiming at the SWE-Bench leaderboard with a human in the loop without such experiments.”

Konwinski sees it as an open challenge to the rest of the business as well as a better standard. According to him, “if you follow the hype, it’s like we should be seeing AI software engineers, AI doctors, and AI lawyers, and that’s just not true.” “It’s a reality check for me if we can’t even achieve more than 10% on a contamination-free SWE-Bench.”

Source link