Lean Repl Cryogenics

Circa November, 2025

Contributors: Nehal Patel (G-Research)

Introduction

Commonly used Lean 4 Read-Eval-Print-Loops (REPL’s) such as the Lean Community repl (“repl” henceforth) and Pantograph provide machine-to-machine interfaces for incremental, but back-trackable, theorem proving. Endeavors such a Morph offer branch, clone, rewind at enterprise scale.

It occurred to me that cloning running REPL’s with CRIU might be a good middle ground between the checkpointing capabilities built into repl on the one hand and Morph on the other.

This post pulls that thread…

Let’s first start with an existence proof, which demonstrates

launching a repl (lake env .../bin/repl)
executing repl commands that import Mathlib and define a theorem with a sorry
“freezing” the repl process to disk (without having to kill it) pdum-criu shell freeze --dir ... --pid ...
rematerializing (“thawing”) the frozen image into an independent, running process: pdum-criu shell thaw --dir ...
performing a “freeze & thaw” in a single step (pdum-criu shell beam ...)

Prelude

repl let’s one enter JSON-formatted commands one-by-one and examining the internal Lean 4 state along the way, much in the way one can do in VS Code and the Lean InfoView. repl uses persistent data structures that allow for efficiently applying commands not to just the latest state, but also to prior states. This allows the user to quickly rewind or to create trees of state from which the user may explore multiple proof strategies. repl also provides facilities for pickling state to disk. This pickled state may then be loaded by multiple new repl’s enabling parallel search starting from a common state.

CRIU is a Linux system utility that allows users to leverage kernel capabilities to checkpoint a process tree and restore it later. Filled with Russian wizardry, it provides primitives for handling file descriptors, sockets, ttys, and multi-threading. We will explore using CRIU to provide an alternate mechanism to repl’s pickle.

CRIU is only going to work on Linux

habemus-papadum-criu is a high-level library and CLI that wraps CRIU to simplify common workflows for freezing and thawing running processes.

`pickle` vs. CRIU

We will compare checkpointing and restoring repl state to disk, first measuring the performance of the pickle command built into repl and then using CRIU’s checkpointing capabilities. A useful example to keep in mind is a scenario where one has evolved a proof to a certain state and now would like to efficiently spawn multiple copies of the repl from that state to explore search paths in parallel.

repl’s `pickle`

Here are the raw timings for pickle:

Command	Time (seconds)
`{"cmd":"import Mathlib\nopen BigOperators\nopen Real\nopen Nat"}`	3.155
`{"pickleTo":"pickle.olean","env":0}`	0.001
`{"unpickleEnvFrom":"pickle.olean"}`	2.367

The size of pickle.olean is 1.5 KB.

Code

Timing code:

#!/usr/bin/env python3
import contextlib
import json
import subprocess
import sys
import time

REPL_CMD = ["lake", "env", "../../.lake/build/bin/repl"]


def read_blocks():
    chunk = []
    for line in sys.stdin:
        if line.strip() == "":
            if chunk:
                yield "".join(chunk)
                chunk = []
            continue
        chunk.append(line)
    if chunk:
        yield "".join(chunk)


def main():
    proc = subprocess.Popen(
        REPL_CMD, stdin=subprocess.PIPE, stdout=subprocess.PIPE, text=True
    )
    try:
        for block in read_blocks():
            text = block.rstrip("\n")
            if not text:
                continue
            obj = json.loads(text)
            flat = json.dumps(obj, separators=(",", ":"))
            start = time.perf_counter()
            proc.stdin.write(text + "\n\n")
            proc.stdin.flush()
            while True:
                line = proc.stdout.readline()
                if line == "":
                    raise RuntimeError("REPL terminated")
                if line.strip() == "":
                    break
            elapsed = time.perf_counter() - start
            print(f"{elapsed:.3f} {flat}", flush=True)
    finally:
        if proc.stdin:
            proc.stdin.close()
        with contextlib.suppress(ProcessLookupError):
            proc.terminate()
        with contextlib.suppress(subprocess.TimeoutExpired, ProcessLookupError):
            proc.wait(timeout=5)


if __name__ == "__main__":
    import os  
    os.chdir("repl/test/Mathlib")
    main()

CRIU

Here are the timings for CRIU:

Step	Time (seconds)
import Mathlib	3.211
freeze to disk	0.829
thaw from disk	0.510

Image size: 455 MB

Code

#!/usr/bin/env python3
"""Launch, freeze, and thaw a goblin to measure restore latency."""

from __future__ import annotations

import os
import shlex
import shutil
import subprocess
import sys
import time
from pathlib import Path

from pdum.criu import goblins

EXECUTABLE = ["lake", "env", str(Path("repl/.lake/build/bin/repl").resolve())]
WORKDIR = Path("repl/test/Mathlib").resolve()
IMAGES_DIR = Path("/tmp/time-demo-image")
THAW_PIDFILE = IMAGES_DIR / "thaw.pid"
THAW_LOG = IMAGES_DIR / "thaw.log"
PRIME_COMMAND = '{"cmd": "import Mathlib\\nopen BigOperators\\nopen Real\\nopen Nat"}'
PAYLOAD_TEXT = '{"cmd": "def f := 37"}' #there is no way to know when that is done, so we send a simple command and wait for the reponse


def _launch_process(command: list[str], workdir: Path) -> subprocess.Popen[bytes]:
    """Launch the target process with pipe-based stdio."""

    print(f"Launching {command} (cwd={workdir})")
    env = {**os.environ, "UV_USE_IO_URING": "0"}
    proc = subprocess.Popen(
        command,
        cwd=str(workdir),
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        start_new_session=True,
        env=env,
    )
    if proc.stdin is None or proc.stdout is None or proc.stderr is None:
        raise RuntimeError("failed to capture stdio pipes for the target process")
    return proc


def _send_command_read_response(
    writer,
    reader,
    command: str,
    *,
    stage: str,
) -> list[str]:
    """Write ``command`` and collect stdout until a blank line terminates the response."""

    if writer is None or reader is None:
        raise RuntimeError(f"{stage} pipes are not available")

    payload = (command.rstrip("\n") + "\n\n").encode("utf-8")
    print(f"[{stage}] Sending payload: {command!r}")
    writer.write(payload)
    writer.flush()

    lines: list[str] = []
    while True:
        raw = reader.readline()
        if not raw:
            raise RuntimeError(f"{stage} stdout closed before emitting a blank line")
        line = raw.decode("utf-8", errors="replace").rstrip("\n")
        print(f"[{stage}] Response: {line}")
        if line == "":
            break
        lines.append(line)
    return lines


def _prime_process(proc: subprocess.Popen[bytes], command: str) -> None:
    """Send a priming command and ensure the process responds fully."""

    _send_command_read_response(proc.stdin, proc.stdout, command, stage="prime")


def _freeze_process(proc: subprocess.Popen[bytes]) -> Path:
    """Freeze the launched process into the requested images directory."""

    print(f"Freezing PID {proc.pid} into {IMAGES_DIR}")
    log_path = goblins.freeze(
        proc.pid,
        IMAGES_DIR,
        leave_running=False,
        shell_job=False,
    )
    print(f"Freeze complete (log: {log_path})")
    return log_path


def _cleanup_process(proc: subprocess.Popen[bytes]) -> None:
    """Best-effort termination of the original process."""

    for stream in (proc.stdin, proc.stdout, proc.stderr):
        if stream is None:
            continue
        try:
            stream.close()
        except Exception:
            pass
    try:
        proc.terminate()
    except OSError:
        return
    try:
        proc.wait(timeout=5)
    except Exception:
        pass


def prepare_images(
    executable: list[str],
    workdir: Path,
    command: str,
) -> tuple[Path, float, float]:
    """Launch the target, issue a command, and freeze its state."""

    start_launch = time.perf_counter()
    proc = _launch_process(executable, workdir)
    try:
        _prime_process(proc, command)
        prime_done = time.perf_counter()
        launch_prime_elapsed = prime_done - start_launch

        freeze_start = time.perf_counter()
        log_path = _freeze_process(proc)
        freeze_elapsed = time.perf_counter() - freeze_start
        return log_path, launch_prime_elapsed, freeze_elapsed
    finally:
        _cleanup_process(proc)


def measure_thaw(images_dir: Path, message: str) -> float:
    """Thaw a goblin, send a message, and time until a blank line appears."""

    start = time.perf_counter()
    pidfile = THAW_PIDFILE
    log_path = THAW_LOG
    if pidfile.exists():
        pidfile.unlink()
    if log_path.exists():
        log_path.unlink()
    goblin = goblins.thaw(
        images_dir,
        shell_job=False,
        detach=True,
        pidfile=pidfile,
        log_path=log_path,
    )
    #print(f"Waiting for PID file {pidfile}")
    #pid = _wait_for_pidfile(pidfile)
    #print(f"Thawed goblin PID: {pid} (log: {log_path})")
    try:
        _send_command_read_response(goblin.stdin, goblin.stdout, message, stage="thaw")
    finally:
        goblin.close()
    return time.perf_counter() - start


def _wait_for_pidfile(pidfile: Path, timeout: float = 10.0) -> int:
    """Poll ``pidfile`` until CRIU writes the restored PID."""

    deadline = time.time() + timeout
    while time.time() < deadline:
        try:
            data = pidfile.read_text().strip()
        except FileNotFoundError:
            time.sleep(0.05)
            continue
        if data:
            return int(data)
        time.sleep(0.05)
    raise TimeoutError(f"timed out waiting for pidfile {pidfile}")


def main() -> None:
    images_dir = IMAGES_DIR
    executable = EXECUTABLE
    workdir = WORKDIR.expanduser().resolve()

    try:
        if images_dir.exists():
            print(f"Removing existing images directory {images_dir}")
            shutil.rmtree(images_dir)
        images_dir.mkdir(parents=True, exist_ok=True)
    except OSError as exc:
        print(f"error: failed to create images directory: {exc}", file=sys.stderr)
        raise SystemExit(1) from exc

    print(f"Preparing goblin checkpoint in {images_dir}")
    try:
        log_path, launch_prime_elapsed, freeze_elapsed = prepare_images(executable, workdir, PRIME_COMMAND)
        print(f"Startup + prime elapsed: {launch_prime_elapsed:.3f}s")
        print(f"Freeze log written to {log_path}")
        print(f"Freeze duration: {freeze_elapsed:.3f}s")

        print(f"Thawing goblin from {images_dir}")
        thaw_elapsed = measure_thaw(images_dir, PAYLOAD_TEXT)
        print(f"Thaw + response elapsed: {thaw_elapsed:.3f}s")
    except Exception as exc:  # pragma: no cover - demo CLI
        print(f"error: {exc}", file=sys.stderr)
        raise SystemExit(1) from exc
    finally:
        if images_dir.exists():
            print(f"Images directory usage:")
            os.system(f"du -sh {shlex.quote(str(images_dir))}")
        else:
            print(f"Images directory {images_dir} does not exist; skipping usage report")

if __name__ == "__main__":
    main()

To freeze repl with CRIU, it must be launched with UV_USE_IO_URING=0 (cf. #2131)

The unpickle / thaw timings demonstrate the major differences between the two approaches. unpickling esssentially replays the prior the session commands while CRIU is loading pages of memory from disk. This simple example, which only imports Mathlib, is not necessarily indicative of realistic scenarios, but this post aims mostly to demonstrate CRIU so we will leave it here for now.

So what?

Not too much, to be honest. CRIU is certainly a fun parlor trick to keep in one’s grimoire, and it is quite nice to see that repl and the Lean4 machinery are largely compatible with its approach (just don’t forget UV_USE_IO_URING=0). It’s possible there are scenarios and / or customized point applications that might benefit from this technique. But it has also been my (admittedly limited) experience that the bitter lesson for AI & Lean is that elegant and efficient MCTS gets trounced by simply teaching codex/claude-code how to use echo "import Mathlib\n..." | lake env lean --stdin and that Mathlib lives in .lake/packages so it can grep away.

And also…

CRIU has many advanced features: for instance page, servers and methods for creating efficient incremental snapshots. Using the habemus-papadum-criu library is always going to be a Faustian bargain. Feel free to use it to “have a play,” but make sure you graduate onto learning the details if you decide to use CRIU for anything substantial.

Cheers

If you have questions, comments, corrections or would like to collaborate, please don’t hesitate to reach out.