OpenAI’s o3 model, initially introduced with strong performance claims, is facing renewed scrutiny after third-party evaluations revealed significantly lower benchmark results than previously reported.
Highlights
Independent assessments, particularly on the FrontierMath dataset—a benchmark for evaluating advanced reasoning in AI systems—indicate that the model’s performance may fall short of OpenAI’s initial statements.
When OpenAI launched the o3 model in December, it emphasized that the system had outperformed all known competitors on FrontierMath.
According to Chief Research Officer Mark Chen, the model scored above 25% on the test set using aggressive compute configurations during internal evaluations. At the time, competing models had reportedly achieved scores below 2%.
Recent findings from research organization Epoch AI suggest the publicly released version of o3 performs closer to 10% on an updated FrontierMath dataset—less than half the score originally highlighted by OpenAI.
Variations in Model Version and Testing Conditions
Epoch AI tested the April release of the o3 model using a more recent version of FrontierMath, which consists of 290 high-difficulty math questions.
This differs from the earlier version of the benchmark used by OpenAI, which contained 180 questions.
While Epoch acknowledges that differences in testing conditions may account for some variation, the overall drop in accuracy has raised questions about consistency between internal benchmarks and publicly accessible models.
Further clarity emerged from the ARC Prize Foundation, which had early access to a pre-release version of o3.
ARC confirmed that the publicly released variants across compute tiers were smaller and less capable than the internal model they previously tested. In the AI field, larger models and higher compute typically yield better benchmark scores, potentially explaining some of the discrepancy.
OpenAI’s technical staff member Wenda Zhou addressed the issue during a livestream, stating that the public version of o3 was optimized for practical applications such as lower latency and improved usability.
These trade-offs, intended to enhance real-world deployment, can impact benchmark performance. “You won’t have to wait as long when you’re asking for an answer,” Zhou noted, pointing to inference speed as a key consideration.
Timeline of Disclosures and Benchmark Development
OpenAI had early access to the FrontierMath benchmark, which it had commissioned Epoch AI to develop.
The dataset comprises 300 advanced math problems, of which OpenAI had access to 250—excluding a 50-question holdout set reserved for unbiased evaluation. This prior access has sparked concerns about how performance claims were interpreted and communicated.
Additionally, Epoch AI did not disclose OpenAI’s funding and involvement in the benchmark’s creation until after the o3 model was made public.
This delayed transparency has led to criticism from the broader AI community, with some contributors to FrontierMath expressing concerns about impartiality.
Industry Reactions
The incident has drawn comparisons to other recent benchmark controversies within the AI sector. For example, xAI faced questions over its presentation of benchmark data for Grok-3, and Meta previously promoted results for a model variant that was not publicly released.
In this context, OpenAI’s o3 rollout is viewed by some as part of a broader issue regarding how AI performance is reported and validated.
AI experts have publicly questioned the claims made around o3. Researcher Gary Marcus likened the situation to a lack of independent validation, drawing parallels to other high-profile cases in tech.
François Chollet, creator of the ARC-AGI benchmark, also challenged OpenAI’s claim that o3 had exceeded human-level performance on his benchmark, stating the model still faces key limitations.
Future Benchmarking Practices
The situation has renewed focus on the need for transparent and standardized benchmarking in the AI industry.
As models grow in complexity and application, ensuring consistency between internal testing conditions and publicly reported results is essential for maintaining trust.
The discrepancy between versions, compute configurations, and test subsets highlights the importance of clearly disclosing model variants and evaluation contexts.
With OpenAI planning to release o3-pro, a more advanced variant, and with smaller models like o3-mini-high and o4-mini already surpassing the base o3 on FrontierMath, it appears the current o3 version serves as an interim step rather than a final showcase of capabilities.