StepFun-Prover Preview: Let's Think and Verify Step by Step

Aug 13rd, 2025

StepFun-Prover Preview is a large language model designed for formal theorem proving through tool-integrated reasoning.

Open Source:Github

We are excited to the release of StepFun-Prover-Preview-7B and StepFun-Prover-Preview-32B, large language models designed for formal theorem proving through tool-integrated reasoning. Using a reinforcement learning pipeline that incorporates tool-based interactions, StepFun-Prover Preview can achieve strong performance in generating Lean 4 proofs with minimal sampling. Our approach enables the model to emulate human-like problem-solving strategies by iteratively refining proofs based on real-time environment feedback. On the miniF2F-test benchmark, StepFun-Prover-Preview-32B achieves a pass@1 success rate of $70\%$ .

Our key contributions include:

We propose a dynamic reasoning framework that enables LLMs to agentically control interactions with the Lean 4 environment. The process interleaves natural language reasoning, Lean 4 code snippets, and feedback from the real-time environment prior to the final generation of the answers.
Our approach develops tool-integrated reasoning through sequential training phases: initial cold-start training establishes fundamental environment interaction patterns, followed by iterative refinement through reinforcement learning from proof outcomes combined with supervised fine-tuning for precise capability improvement.

Figure 1: Performance comparison on MiniF2F-test. y-axis shows the pass@1, which is computed by averaging 32 trials; while x-axis denotes the maximum number the provers are allowed to interact with Lean4-REPL before getting successful proof. Note both DeepSeek-Prover and Kimina-Prover utilize at least 32K token context length. Stepfun-Prover was evaluated using 20K context window including feedback from Lean4-REPL.

Introduction

Prior to the advent of reasoning models [10, 11, 6, 12, 13], formal theorem provers primarily employed LLMs either to generate individual proof steps [14, 15, 16, 17, 18, 19] or to directly produce complete proofs [19, 20, 21, 22, 23]. To improve proof search exploration, LLMs-based approaches often combine with traditional tree search methods, such as Best-First Search [15, 17, 18] and Monte Carlo Tree Search [14, 16, 19]. Central to these systems, the search algorithm navigates proof paths via heuristics, offering improved exploration at the expense of increased computational overhead.

More recent approaches [1, 3, 5, 24] explore informal reasoning combined with tool interaction, allowing models to revise and improve their outputs based on Lean 4 feedback. Most of these methods still typically use environmental feedback only at test time, without learning from it.

Some approaches, such as Kimina-Prover, have begun to incorporate REPL error signals. However, they usually permit only limited use of such feedback (e.g., a single retry), rather than allowing the model to interact freely with the verifier until it believes the proof is complete.

Moreover, the marginal gain from increasing sampling attempts—e.g., only $3\% \sim 4\%$ accuracy boost from $32$ to $64$ samples [2]—suggests inefficient error recovery and persistent failure modes. In constrast, human users can resolve errors in far fewer steps by adaptively leveraging verifier feedback. This raises a key question:

Can formal theorem provers self-evolve suitable patterns to think, interact with formal verifier, and derive the formal proof efficiently?

In this work, we explore this question by enabling open-ended, tool-assisted reasoning. Our system allows the model to decide when to re-invoke Lean 4 and when to stop reasoning, supporting reflective, adaptive proof generation without constraining the number of Lean 4 interactions.

Methodology

In this section, we present our end-to-end training pipeline (see Figure 2) for developing tool-integrated reasoning models.

Figure 2: Our training pipeline. Left. Tool-integrated RL and iterative RL-SFT cycle. The upper left illustrates data preparation. Right. Tool-integrated reasoning pattern.

Cold-start data curation

We first compile existing open-sourced formal mathematics datasets. Additionally, we employ Kimina-Autoformalizer-7B [1] to autoformalize selected problems from the Numina Math 1.5 dataset. However, our analysis reveals that both the open-sourced formal problems and those generated via autoformalization exhibit several recurring issues, such as tautological statements, contradictory premises, irrelevant conclusions and etc.To ensure data quality, we employ DeepSeek-V3 [7] to filter out low-quality problems.

Subsequently, we construct a cold start dataset $\mathcal{D}_{0}$ through two parallel processes:

Multi-turn data collection. We use Claude Sonnet 4 [9] to gather approximately 8,000 effective multi-turn trajectories, which serve as the initial multi-turn reasoning data.
Single-turn data processing. We refine the output generated by Kimina-Prover-72B [2] by removing all intermediate tactic blocks and reasoning steps between them, retaining the natural language reasoning of the problem and the final answer in Lean 4.

Two-stage Supervised Fine-tuning

We adopt the following two-stage training approach:

We first supervised fine-tune the starting point model using question-answer pairs from the open-sourced STP dataset [4].
Subsequently, we perform continual supervised fine-tuning using our cold start dataset $\mathcal{D}_{0}$ . This two-stage process yields our cold-started model, denoted as $\mathcal{M}_{0}$ .

Response Patterns Fusion

While the cold-started model $\mathcal{M}_{0}$ demonstrated basic environment interaction capabilities, it exhibited difficulty in effectively integrating the distinct response patterns from Claude and Kimina styles. To address this challenge, we developed a novel fusion approach:

Wrong responses collection. We first gathered incorrect responses generated by Kimina-Prover-72B.
Correction and dataset construction. Using model $\mathcal{M}_{0}$ , we systematically corrected these responses while preserving valid reasoning trajectories. This process yielded our refined training dataset $\mathcal{D}_1$ .
Model enhancement. We performed supervised fine-tuning using exclusively $\mathcal{D}_1$ based on the starting point model, resulting in our improved model $\mathcal{M}_{1}$ . The model $\mathcal{M}_1$ serves as the starting point for subsequent reinforcement learning.

Tool-integrated Reinforcement Learning

During reinforcement learning stage, our rollout process integrates the collaboration of a LLM with a sandbox-style environment, enabling the production of hybrid content that combines text, Lean 4 code snippets enclosed by $\texttt{<sketch>}$ and $\texttt{</sketch>}$ , and real-time environment feedback enclosed by $\texttt{<REPL>}$ and $\texttt{</REPL>}$ . Concretely, the generation starts with text-based reasoning $g_1$ . Then the model proposes the first Lean 4 solution $s_1$ to the problem. When a sketch termination trigger ( $\texttt{</sketch>}$ ) is detected, the generation pause and the Lean 4 solution $s_1$ is parsed and sent to sandbox environment for execution. Upon completion, the sandbox’s output $f_1$ (successful results or error messages) is filled within $\texttt{<REPL></REPL>}$ tags and feedback to the model, which continues generating the rollout until either providing a final answer $p$ or producing a new Lean 4 solution, ultimately producing a hybrid reasoning trajectory $[g_1 \oplus s_1 \oplus f_1 \oplus\cdots\oplus p]$ .

When the model believes it has reached a final answer, it is required to generate a special delimiter $\texttt{</think>}$ and place its final Lean 4 solution after this marker, enabling reliable rule-based verification. We evaluate the correctness of the code segment following $\texttt{</think>}$ to assign the trajectory reward: if the proof passes REPL verification, the reward is set to 1; otherwise, 0. We perform reinforcement learning from the instruction-tuned model $\mathcal{M}_1$ using GRPO, based on the tool-integrated rollout and reward setup.

Figure 3: Training reward and test accuracy (pass@1) of the tool-integrated GRPO.

Iterative RL-SFT Cycle

Upon convergence of the tool-integrated reinforcement learning, we collect correct trajectories from problems scoring below 0.5 accuracy during training. These trajectories are then filtered to retain only those demonstrating comprehensive REPL feedback analysis. The selected trajectories subsequently undergo another round of supervised fine-tuning. Finally, we implement a secondary reinforcement learning phase to further enhance the model’s tool-integrated reasoning capabilities.

Experimental Results

Results on MiniF2F Benchmark

We evaluate our models on Numina’s corrected miniF2F benchmark from their HuggingFace repository. StepFun-Prover-Preview-32B achieves $70.0\%$ pass@1, surpassing prior SOTA results with a smaller model size, as shown in Table 1. This demonstrates the effectiveness of our tool-integrated reasoning approach in low-pass settings.

Model	MiniF2F-test (Pass@1)
DeepSeek-Prover-V2-7B	$58.6\%$
DeepSeek-Prover-V2-671B	$61.9\%$
Kimina-Prover-8B	$61.1\%$
Kimina-Prover-72B	$63.9\%$
StepFun-Prover-Preview-7B	$66.0\%$
StepFun-Prover-Preview-32B	$70.0\%$

Tabel 1: Performance comparison of theorem proving models on the miniF2F-test dataset under minimal sampling budgets. For StepFun-Provers, we generate 32 responses per query to estimate Pass@1.

Test Time Scaling in Tool-integrated Reasoning

Tool-integrated reasoning models scale performance with test-time compute. As shown in Table 2, StepFun-Prover-Preview-32B shows consistent performance improvements on the miniF2F-test benchmark when we extend the maximum generation length.

Maximum generation length	MiniF2F-test (Pass@1)
4096	$58.3\%$
8192	$66.5\%$
12288	$68.9\%$
16384	$69.9\%$
20480	$70.0\%$

Table 2: Performance of StepFun-Prover-Preview-32B on MiniF2F-test with various maximum generation lengths. Longer response can get better results.

As shown in Figure 4, the REPL interaction distribution for successful proofs on the miniF2F-test benchmark (using StepFun-Prover-Preview-32B with 32 responses per problem) reveals the robust ability of StepFun-Prover-Preview to solve problems through extended interaction sequences during reasoning process, demonstrating that correct solutions can emerge after multiple rounds of refinement.

Figure 4: REPL interaction frequency distribution.

Examples of Tool-integrated Reasoning

Through our tool-integrated reinforcement learning framework with cold start, StepFun-Prover-Preview develops several advanced reasoning capabilities that leverage real-time environment feedback. Here are some StepFun-Prover outputs demonstrating its diverse reasoning patterns.

Proactive resolution of Lean 4 warnings in the REPL even when no errors are present

text

Dynamic proof restructuring when execution exceeds time thresholds (e.g., 60 seconds)

text

Iterative solution improvement through REPL feedback analysis

text

Citation

latex

References

[1] Wang, Haiming, et al. “Kimina-prover preview: Towards large formal reasoning models with reinforcement learning.” arXiv preprint arXiv:2504.11354 (2025).

[2] https://huggingface.co/blog/AI-MO/kimina-prover

[3] Ren, Z. Z., et al. “DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.” arXiv preprint arXiv:2504.21801 (2025).

[4] Kefan Dong, et al. “STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving.” arXiv preprint arXiv:2502.00212 (2025).

[5] https://blog.goedel-prover.com/

[6] Guo, Daya, et al. “DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.” arXiv preprint arXiv:2501.12948 (2025).

[7] Liu, Aixin, et al. “Deepseek-V3 technical report.” arXiv preprint arXiv:2412.19437 (2024).

[8] Huaiyuan Ying, et al. “Lean Workbook: A large-scale Lean problem set formalized from natural language math problems.” arXiv preprint arXiv:2406.03847 (2024).

[9] https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf

[10] Jaech, Aaron, et al. “Openai o1 system card.” arXiv preprint arXiv:2412.16720 (2024).

[11] https://openai.com/index/introducing-o3-and-o4-mini

[12] https://qwenlm.github.io/blog/qwq-32b-preview

[13] https://qwenlm.github.io/blog/qwq-32b/

[14] Guillaume Lample, et al. “HyperTree Proof Search for Neural Theorem Proving.” NeurIPS 2022 (2022).

[15] Stanislas Polu, et al. “Formal Mathematics Statement Curriculum Learning.” arXiv preprint arXiv:2202.01344 (2022).

[16] https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

[17] Zijian Wu, et al. “InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems.” arXiv preprint arXiv:2410.15700 (2024).

[18] Ran Xin, et al. “Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving.” arXiv preprint arXiv:2502.03438 (2025).

[19] Huajian Xin, et al. “DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search.” arXiv preprint arXiv:2408.08152 (2024).

[20] Yong Lin, et al. “Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving.” arXiv preprint arXiv:2502.07640 (2025).

[21] Emily First, et al. “Baldur: Whole-Proof Generation and Repair with Large Language Models.” arXiv preprint arXiv:2303.04910 (2023).

[22] Albert Q. Jiang, et al. “Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs.” arXiv preprint arXiv:2210.12283 (2022).

[23] Haiming Wang, et al. “LEGO-Prover: Neural Theorem Proving with Growing Libraries.” arXiv preprint arXiv:2310.00656 (2023).

[24] Yichi Zhou, et al. “Solving Formal Math Problems by Decomposition and Iterative Reflection.” arXiv preprint arXiv:2507.15225 (2025).

About StepFun Careers