Gemini Deep Think learns math, wins gold medal at International Math Olympiad
3 day ago / Read about 15 minute
Source:ArsTechnica
DeepMind followed IMO rules to earn gold, unlike OpenAI.


Credit: Google DeepMind

The students participating in the annual International Math Olympiad (IMO) represent some of the most talented young computational minds in the world. This year, they faced down a newly enhanced array of powerful AI models, including Google's Gemini Deep Think. The company says it put its model to the test using the same rules as human participants, and it improved on an already solid showing from last year.

Google says its specially tuned math AI got five of the six questions correct, which is good enough for gold medal status. And unlike OpenAI, Google played by the rules set forth by the IMO.

A new Gemini

The Google DeepMind team participated in last year's IMO competition using an AI composed of the AlphaProof and AlphaGeometry 2 models. This setup was able to get four of the six questions correct, earning silver medal status—only half of the human participants earn any medal at all.

In 2025, Google DeepMind was among a group of companies that worked with the IMO to have their models officially graded and certified by the coordinators. Google came prepared with a new model for the occasion. Gemini Deep Think was announced earlier this year as a more analytical take on simulated reasoning models. Rather than going down one linear line of "thought," Deep Think runs multiple reasoning processes in parallel, integrating and comparing the results before giving a final answer.

According to Thang Luong, DeepMind senior scientist and head of the IMO team, this is a paradigm shift from last year's effort. In 2024, an expert had to translate the natural language questions into "domain specific language." At the end of the process, said expert would have to interpret the output. Deep Think, however, is natural language, end to end, and was not specifically designed to do math.

In the past, making LLMs better at math would involve reinforcement learning with final answers. Luong explained to Ars that models trained in this way can get to the correct answer, but they have "incomplete reasoning," and part of the IMO grading is based on showing your work. To prepare Deep Think for the IMO, Google used new reinforcement learning techniques with higher-quality "long answer" solutions to mathematical problems, giving the model better grounding in how to handle every step on the way to an answer. "With this kind of training, you can actually get robust, long-form reasoning," said Luong.


Credit: Google DeepMind

As you might expect, Deep Think takes more time to generate an output compared to the simpler versions you can access in the Gemini app. However, the AI followed the same rules as the flesh-and-blood participants, which was only possible because of its ability to ingest the problems as natural language. Gemini was provided with the problem descriptions and gave its answers within the 4.5-hour time limit of the competition.

Rigorous proofs

AI firms like DeepMind have taken an interest in the IMO over the past few years because it presents a unique challenge. While the competition is aimed at pre-university mathematicians, the questions require critical thinking and an understanding of multiple mathematical disciplines, including algebra, combinatorics, geometry, and number theory. Only the most advanced AI models have any hope of accurately answering these multi-layered problems.

The DeepMind team has pointed out some interesting aspects of Deep Think's performance, which they say come from its advanced training. In the third problem (below), for example, many human competitors applied a graduate-level concept called Dirichlet's Theorem, using mathematics outside the intended scope of the competition. However, Deep Think recognized that it was possible to solve the problem with simpler math. "Our model actually made a brilliant observation and used only elementary number theory to create a self-contained proof of the given problem," said DeepMind researcher and Brown University professor Junehyuk Jung.

The DeepMind team says the model came up with a "brilliant" solution to this problem.
Credit: Google DeepMind

As for the one Deep Think got wrong, the team says that was objectively the hardest of the competition. The question asked about the minimum number of rectangles needed to cover a given space. Jung explains that Deep Think started from an incorrect hypothesis, believing that the answer would be greater than or equal to 10, so it was lost from the start. "There's no way it's going to solve it because that is not true to begin with," said Jung.

So Deep Think lost points on that problem, but Jung notes that only five students managed to get that one right. Still, Google got 35 points to earn a gold medal. Only about 8 percent of the human participants can reach that level.

Google stresses that Deep Think went through the same evaluation as the students do. OpenAI has also announced results from the IMO, but it did not work with the organization to adhere to the established process. Instead, it had a panel of former IMO participants grade its answers and awarded itself a gold medal.

"We confirmed with the IMO organization that we actually solved five perfectly," said Luong. "I think anyone who didn't go through that process, we don't know, they might have lost one point and gotten silver."

Google says the version of Deep Think tuned for the IMO is sticking around. It is currently being rolled out to a group of trusted testers that includes mathematicians. Eventually, this model will be provided to Google AI Ultra subscribers, who pay $250 per month for access to Google's biggest and most expensive models. DeepMind plans to continue iterating on this model and will be back next year in search of a perfect score.