Google’s Gemini 2.5 Pro, the company’s most advanced large language model to date, has reportedly completed a full playthrough of Pokémon Blue, the classic 1996 Game Boy title.
Highlights
The achievement was not part of an official Google experiment but was instead facilitated by an independent software engineer known online as Joel Z.
Despite being unaffiliated with Google, the project attracted attention from company executives, including CEO Sundar Pichai, who shared news of the completion on X with the post: “What a finish! Gemini 2.5 Pro just completed Pokémon Blue!”
The livestream, titled Gemini Plays Pokémon, documented Gemini’s progress as it navigated through the game using a custom-built interface designed by Joel Z. While not part of a formal research initiative, the project captured the interest of figures at Google AI.
Weeks prior to the game’s completion, Logan Kilpatrick, Google AI Studio’s product lead, noted Gemini’s in-game progress on social media, highlighting that it had earned its fifth badge.
Pichai joined the conversation at the time with a tongue-in-cheek comment: “We are working on API — Artificial Pokémon Intelligence :).”
The use of Pokémon Blue was intentional. Earlier in 2024, Anthropic had shared updates on its Claude model’s attempts to play Pokémon Red, emphasizing the model’s reasoning abilities in complex and unpredictable environments.
Joel Z cited Claude’s progress and the related Claude Plays Pokémon Twitch project as one of the inspirations behind testing Gemini in a similar context.
However, comparisons between the two projects are limited. Claude has not yet completed Pokémon Red, and both Claude and Gemini rely on custom-built interfaces known as “agent harnesses” to interact with the game.
These systems provide the AI with structured visual and state data from the game, enabling it to interpret scenarios and simulate in-game actions. Each project employs different methods, prompting styles, and tooling, making direct performance comparisons inaccurate.
Joel Z was clear that his project should not be viewed as a benchmark of Gemini’s raw performance. “Please don’t consider this a benchmark for how well an LLM can play Pokémon,” he wrote on his Twitch page. “You can’t really make direct comparisons — Gemini and Claude have different tools and receive different information.”
While Gemini did receive some support during the playthrough, Joel Z clarified the nature of his involvement. Developer interventions were used to guide Gemini’s reasoning and planning abilities, not to provide solutions or step-by-step instructions.
One exception was alerting the model to a known in-game bug that required speaking to a Team Rocket Grunt twice to obtain the Lift Key — an issue resolved in later game versions like Pokémon Yellow.
Joel emphasized that such interventions were aimed at improving Gemini’s autonomous decision-making, rather than bypassing challenges.
“I don’t give specific hints,” he said. “My interventions improve Gemini’s overall decision-making and reasoning abilities.” The Gemini Plays Pokémon project remains active and continues to evolve as a testing ground for AI-agent interaction in open-ended environments.
Multimodal Capabilities in Interactive Tasks
Gemini is designed to process and integrate various types of input, including text, images, audio, video, and code. In the context of Pokémon Blue, it utilized game screenshots and textual overlays to understand the environment and make decisions.
This capability reflects the model’s broader potential to operate in complex, multimodal settings.
Notable Benchmark Achievements
The Gemini Ultra variant has demonstrated strong results in standardized AI benchmarks. It became the first model to outperform human experts on the Massive Multitask Language Understanding (MMLU) benchmark, scoring 90%.
The benchmark evaluates models across 57 academic and professional subjects, offering insight into Gemini’s wide-ranging reasoning skills.
Integration Across Google’s Product Ecosystem
Gemini is already embedded into a variety of Google services. Gemini Pro powers Bard, enhancing the AI assistant’s reasoning and conversational abilities.
Gemini Nano, optimized for on-device use, supports features on Pixel devices, such as “Summarize in Recorder” and “Smart Reply in Gboard.”
Developer Collaboration and Experimentation
The Pokémon project was made possible through the use of a custom agent harness developed by Joel Z, allowing Gemini to interface with the game.
This framework provided game-state awareness and enabled the model to simulate player actions, illustrating how third-party developers can extend LLM capabilities in interactive environments.
AI in Dynamic, Feedback-Driven Settings
Completing a non-linear game like Pokémon Blue showcases an AI’s capacity for memory, decision-making, and long-term planning — all essential elements for operating in real-world, dynamic environments.
The project demonstrates how language models can be applied beyond traditional chatbot use cases, into simulations that require adaptability and iterative reasoning.
Although the project blends experimentation with entertainment, it underscores a growing trend: large language models are increasingly being tested in interactive settings where decisions must be made over time, under uncertain and evolving conditions.
While games have historically served as benchmarks for AI — from chess and Go to StarCraft — titles like Pokémon Blue add narrative, exploration, and planning complexity that more closely mirror human problem-solving in real-world applications.
Google’s successful playthrough of Pokémon Blue offers insight into how large-scale AI models can be creatively applied when paired with custom tools and independent innovation.
Whether this leads to further experimentation in gaming or real-world simulations remains to be seen — but it clearly demonstrates how collaboration between model capabilities and developer frameworks can unlock new potential for AI.