In a notable milestone for artificial intelligence, both OpenAI and Google DeepMind have reported that their latest models reached gold-medal-level performance in the 2025 International Math Olympiad (IMO)—the prestigious mathematics competition traditionally dominated by the world’s top-performing high school students.
Highlights
- Breakthrough in AI Reasoning: OpenAI and Google DeepMind report solving 5 out of 6 problems from the 2025 International Math Olympiad (IMO), matching gold medalist performance.
- Informal, Natural Language Reasoning: Unlike past symbolic methods, both models reasoned in plain English—signaling a leap in AI’s ability to handle abstract, creative problems.
- Two Philosophies, One Goal: OpenAI used pure LLM chains for inference, while DeepMind used a hybrid symbolic-LLM method—showcasing different strategies to achieve similar outcomes.
- Evaluation Controversy: OpenAI’s evaluation relied on independent reviewers and preceded official IMO grading; DeepMind followed the formal review process, sparking debate about procedure and credibility.
- What Gold Means: Only ~10% of the 630+ human contestants earned gold medals. These AI models matched the top-tier performance in one of the world’s most rigorous academic competitions.
- No Public Release Yet: Neither OpenAI nor DeepMind have released the models, citing experimental status. OpenAI suggests public access could take “many months.”
- Strategy Over Speed: DeepMind’s emphasis on rigor and transparency reframes the AI race not just as a competition in capability, but also in scientific trust and governance.
Both companies announced that their systems correctly solved five out of six IMO problems, surpassing the performance of most student participants.
Unlike previous efforts that relied on formal symbolic systems requiring human pre-processing, this year’s models tackled the problems using “informal” reasoning. These models interpreted natural language questions directly and generated proof-based answers in plain English.
Informal Reasoning
This shift to informal systems marks a turning point in AI’s reasoning capabilities. Machines have historically struggled with the kind of ambiguity, multi-step logic, and creative thinking required in math olympiads.
Researchers from both companies describe this achievement as a substantial advance in general reasoning, particularly in solving non-verifiable problems that extend beyond traditional math exercises or programming tasks.
Google DeepMind used a hybrid approach that combined formal symbolic logic with large language model reasoning, while OpenAI’s approach relied entirely on LLM-generated reasoning—referred to internally as “pure LLM chains.”
Both methods underscore evolving philosophies in AI model design and hint at future directions for advanced problem-solving AI.
Differing Approaches and Disputes Over Recognition
While both results are technically impressive, their release sparked debate—not about performance, but about procedure.
OpenAI publicized its achievement shortly after the IMO student awards ceremony, but before undergoing any official grading process sanctioned by the IMO. Instead, the company enlisted three former IMO medalists to independently evaluate its model’s output.
Google DeepMind, in contrast, participated in the official IMO evaluation. The company collaborated directly with the competition’s organizers and waited until the formal grading was completed before sharing its results publicly.
Thang Luong, who leads DeepMind’s math reasoning research, emphasized the importance of adhering to the IMO’s established evaluation standards. According to Luong, “Any evaluation not based on that guideline cannot claim gold-level performance.”
OpenAI has since clarified that it did not initially enter the formal process but chose to contact IMO organizers only after reaching what it believed to be a gold-worthy performance.
While OpenAI states it waited until after the student awards to make its announcement, some within the AI research community expressed concern over the timing and process transparency.
What Gold Means in the IMO
Out of over 630 student participants this year, only around 10% received gold medals. That AI systems could match this level underscores a rapid acceleration in machine reasoning capabilities.
IMO problems demand creativity, deep abstraction, and long-form logical deduction—skills long thought to be uniquely human.
Unlike structured coding challenges or basic logic puzzles, these problems often require sustained reasoning across multiple conceptual domains, a hallmark of elite human cognition.
Performance and Methodology
- Google DeepMind’s Model – The hybrid reasoning engine that blends formal symbolic logic with natural language outputs (likely based on Gemini Deep Think).
- OpenAI’s Model – Pure LLM-based reasoning, without external symbolic formalization. All logic is derived from generative model chaining.
Both models executed their reasoning processes at test time using substantial computational resources.
OpenAI has not disclosed the compute cost, but its approach appears to have relied heavily on deep inference-time reasoning, pushing the bounds of what’s possible with large-scale models.
For AI Research and Education
This achievement is being viewed as a precursor to broader applications of AI in mathematics and science.
By demonstrating informal proof generation at a gold-medal level, both labs signal a future where AI could contribute meaningfully to open-ended scientific problems—not just replicate known patterns.
Neither company plans to release these exact models in the near future. OpenAI has suggested it will be “many months” before a public rollout, underscoring the models’ current experimental status.