The Three Horsemen
Circa December, 2025Contributors: Nehal Patel (G-Research)
Introduction
The Three Horsemen, Anthropic’s claude code, OpenAI’s codex, Google’s gemini, have completely
changed how we
Putnam
Let’s start with a concrete datapoint from PutnamBench. Each AI agent is invoked directly
Replace the sorry in 'solutions_replaced_new/putnam_1965_a5_sol.lean' with a valid proof.
Compile the file using `lake env lean solutions_replaced_new/putnam_1965_a5_sol.lean` to
make sure your changes are valid, does not produce errors, and that all sorries are
removed. If you run `lake build` on the entire project, you will definitely see that
many other files have sorries. This is ok.
<important>
You are NOT allowed to introduce
axioms or modify any files other than 'solutions_replaced_new/putnam_1965_a5_sol.lean'.
You are not allowed to change any supporting definitions. You may only complete the proof for the
theorem stated in `solutions_replaced_new/putnam_1965_a5_sol.lean'.
</important>
and an AGENTS.md. Solutions are verified with SafeVerify.
TODO: add gemini
We will also explore ablations (e.g. no AGENTS.md) and perturbations (using a Lean-aware MCP servers, claude skills, and providing a hint in the form of an AI-generated natural language proof). For Putnambench, the best recipe finds 213 proofs, and across all recipes we have come across solutions for 220 distinct problems. Circa Dec 2025, these results would merit third place on the PutnamBench Leaderboard.

Caveats
Collecting these results is a somewhat slow and expensive process, and so we assert the prerogative granted by the blog format to share results as they trickle in. Check back, and we will also post on the Lean Zulip. The approaches we explore are decidely simple and the code to run all of this is mostly plumbing, but we hope to share some relatively tidy version of it after the holidays in early 2026.
Putnam 1962-63 Microscope
The following interactive table allows you to explore the results from various agents on PutnamBench problems from 1962 and 1963. You can filter by agent, whether the solution was SafeVerified, and whether the solution compiled without “sorry” placeholders.
Use the slider below to browse through the filtered results.
SafeVerify?:
Duration:
Check Duration:
Tools Calls Summary
Here is a summary of the tools called by the agent to solve this problem.
Use the slider below to browse through the bash commands executed by the agent on this run.
Output:
Pick-50 Variations
Sample 50 problems from PutnamBench, Formal-IMO, FATE, and MiniF2F and evaluate various agents and configurations.
TODO:
- add details
- clean up table
Chart
TODO: cleanup
Cheers
If you have questions, comments, corrections or would like to collaborate, please don’t hesitate to reach out.
TODO
-
Gemini
- add more numbers once I work around rate limits
- current limit: gemini-3.0-pro -> 6 problems a day , 2.5-pro -> ~100/day. Submitted request for Tier increase
-
Putnam
- add crossfilter
- graph vs decade
- graph vs problem difficulty (1-6)
-
Timeout stats
-
Tool call stats
-
NLP proof hint details
-
Safeverify stats
-
Ablations
- no AGENTS.md
- with MCP server
- with NL proof hint
- Claude skills
- main results
- all of these make very little difference
- claude seems to use the skill only on ~10% of the problems
- claude barely uses MCP server, codex uses it more
-
More Datasets
- PutnamLike
- Combibench
- ???
-
Conclusions
- the point is not the PutnamBench score, but the simplicity of jsut using these agents off-the-shelf
-
Things to ponder
- who owns the loop? (e.g)