The Three Horsemen

Circa December, 2025

Contributors: Nehal Patel (G-Research)

Introduction

The Three Horsemen, Anthropic’s claude code, OpenAI’s codex, Google’s gemini, have completely changed how we build things. In this post, we blindly obey “The Law of the Instrument” and investigate if these tools can prove theorems from PutnamBench, formalimo, FATE, and MiniF2F using Lean (for an introduction to commonly used datasets for evaluating AI theorem proving, see our datasets post).

Putnam

Let’s start with a concrete datapoint from PutnamBench. Each AI agent is invoked directly with a simple prompt, e.g.:

Replace the sorry in 'solutions_replaced_new/putnam_1965_a5_sol.lean' with a valid proof.
Compile the file using `lake env lean solutions_replaced_new/putnam_1965_a5_sol.lean` to
make sure your changes are valid, does not produce errors, and that all sorries are 
removed.  If you  run `lake build` on the entire project, you will definitely see that 
many other files have sorries.  This is ok.  
<important>
You are NOT allowed to introduce 
axioms or modify any files other than 'solutions_replaced_new/putnam_1965_a5_sol.lean'.  
You are not allowed to change any supporting definitions.  You may only complete the proof for the
theorem stated in `solutions_replaced_new/putnam_1965_a5_sol.lean'.
</important>

and an AGENTS.md. Solutions are verified with SafeVerify.

TODO: add gemini

We will also explore ablations (e.g. no AGENTS.md) and perturbations (using a Lean-aware MCP servers, claude skills, and providing a hint in the form of an AI-generated natural language proof). For Putnambench, the best recipe finds 213 proofs, and across all recipes we have come across solutions for 220 distinct problems. Circa Dec 2025, these results would merit third place on the PutnamBench Leaderboard.

Caveats

Collecting these results is a somewhat slow and expensive process, and so we assert the prerogative granted by the blog format to share results as they trickle in. Check back, and we will also post on the Lean Zulip. The approaches we explore are decidely simple and the code to run all of this is mostly plumbing, but we hope to share some relatively tidy version of it after the holidays in early 2026.

Putnam 1962-63 Microscope

The following interactive table allows you to explore the results from various agents on PutnamBench problems from 1962 and 1963. You can filter by agent, whether the solution was SafeVerified, and whether the solution compiled without “sorry” placeholders.

Use the slider below to browse through the filtered results.

| 🤖

SafeVerify?: | Compiles Sorry Free?: | No Timeout?:

Duration: minutes | Tools Calls: | Command Duration: seconds

Check Duration: seconds | Verify Duration: seconds

Tools Calls Summary

Here is a summary of the tools called by the agent to solve this problem.

Use the slider below to browse through the bash commands executed by the agent on this run.

seconds

Output:

Pick-50 Variations

Sample 50 problems from PutnamBench, Formal-IMO, FATE, and MiniF2F and evaluate various agents and configurations.

TODO:

add details
clean up table

Gemini
- add more numbers once I work around rate limits
- current limit: gemini-3.0-pro -> 6 problems a day , 2.5-pro -> ~100/day. Submitted request for Tier increase
Putnam
- add crossfilter
- graph vs decade
- graph vs problem difficulty (1-6)
Timeout stats
Tool call stats
NLP proof hint details
Safeverify stats
Ablations
- no AGENTS.md
- with MCP server
- with NL proof hint
- Claude skills
- main results
  - all of these make very little difference
  - claude seems to use the skill only on ~10% of the problems
  - claude barely uses MCP server, codex uses it more
More Datasets
- PutnamLike
- Combibench
- ???
Conclusions
- the point is not the PutnamBench score, but the simplicity of jsut using these agents off-the-shelf
Things to ponder
- who owns the loop? (e.g)