AI Benchmark Controversy: xAI vs. OpenAI
Debates over AI benchmarks and their reporting by tech companies are increasingly coming to the forefront. Recent accusations and counter-accusations have placed xAI, Elon Musk’s AI company, in the spotlight.
The xAI-Grok 3 Controversy
This week, an OpenAI employee accused xAI of publishing misleading benchmark results for its latest model, Grok 3. In response, Igor Babushkin, one of xAI’s co-founders, insisted that the company was transparent about its findings.
The Role of AIME 2025
xAI published a graph depicting Grok 3’s performance on AIME 2025, a set of challenging math problems from an invitational mathematics exam. AIME 2025, though its validity as an AI benchmark has been questioned, is frequently used to evaluate a model’s mathematical prowess.
The Omission of Cons@64
xAI’s chart compared Grok 3’s performance against OpenAI’s o3-mini-high model on AIME 2025. However, OpenAI pointed out that xAI excluded o3-mini-high’s “cons@64” score. Cons@64 refers to the model’s most frequent answer after 64 attempts, significantly boosting benchmark scores.
The True Comparison
The actual scores of Grok 3 at “@1” (the first attempt) were lower than o3-mini-high’s. Furthermore, Grok 3 Reasoning Beta lagged slightly behind OpenAI’s o1 model in a medium computing setting. Nevertheless, xAI is marketing Grok 3 as the “world’s smartest AI.”
Defending the Claims
Babushkin argued that OpenAI has a history of similar benchmark chart manipulations, typically comparing different iterations of its own models. A neutral third party provided a more comprehensive graph illustrating nearly all models’ performance at cons@64:
Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality it’s DeepSeek propaganda
(I actually believe Grok looks good there, and openAI’s TTC chicanery behind o3-mini-*high*-pass@”””1″”” deserves more scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025
The Hidden Metric: Cost
AI researcher Nathan Lambert highlighted a crucial but often overlooked metric: the computational (and monetary) cost required to achieve the highest scores. This visualization underscores how current AI benchmarks frequently fail to disclose models’ limitations and strengths comprehensively.
Conclusion: The Need for Transparency
The debate over AI benchmarks demands transparency and rigorous standards. While performance metrics like cons@64 can artificially inflate scores, the true capability and cost-effectiveness of AI models remain elusive. Companies should strive for honesty in their reporting, ensuring that consumers and researchers have a comprehensive understanding of model strengths and limitations.
Join the Discussion
We encourage you to share your thoughts on this controversy. Let us know how you think these debates should be handled and what metrics you believe are most important in evaluating AI models.
Feel free to comment below, subscribe to our newsletter to stay informed about the latest in AI developments, or share this article on your social media channels.