We are excited to the release of StepFun-Prover-Preview-7B and StepFun-Prover-Preview-32B, large language models designed for formal theorem proving through tool-integrated reasoning. Using a reinforcement learning pipeline that incorporates tool-based interactions, StepFun-Prover Preview can achieve strong performance in generating Lean 4 proofs with minimal sampling. Our approach enables the model to emulate human-like problem-solving strategies by iteratively refining proofs based on real-time environment feedback. On the miniF2F-test benchmark, StepFun-Prover-Preview-32B achieves a pass@1 success rate of .
Our key contributions include:

Prior to the advent of reasoning models [10, 11, 6, 12, 13], formal theorem provers primarily employed LLMs either to generate individual proof steps [14, 15, 16, 17, 18, 19] or to directly produce complete proofs [19, 20, 21, 22, 23]. To improve proof search exploration, LLMs-based approaches often combine with traditional tree search methods, such as Best-First Search [15, 17, 18] and Monte Carlo Tree Search [14, 16, 19]. Central to these systems, the search algorithm navigates proof paths via heuristics, offering improved exploration at the expense of increased computational overhead.
More recent approaches [1, 3, 5, 24] explore informal reasoning combined with tool interaction, allowing models to revise and improve their outputs based on Lean 4 feedback. Most of these methods still typically use environmental feedback only at test time, without learning from it.
Some approaches, such as Kimina-Prover, have begun to incorporate REPL error signals. However, they usually permit only limited use of such feedback (e.g., a single retry), rather than allowing the model to interact freely with the verifier until it believes the proof is complete.
Moreover, the marginal gain from increasing sampling attempts—e.g., only accuracy boost from to samples [2]—suggests inefficient error recovery and persistent failure modes. In constrast, human users can resolve errors in far fewer steps by adaptively leveraging verifier feedback. This raises a key question:
In this work, we explore this question by enabling open-ended, tool-assisted reasoning. Our system allows the model to decide when to re-invoke Lean 4 and when to stop reasoning, supporting reflective, adaptive proof generation without constraining the number of Lean 4 interactions.
In this section, we present our end-to-end training pipeline (see Figure 2) for developing tool-integrated reasoning models.

We first compile existing open-sourced formal mathematics datasets. Additionally, we employ Kimina-Autoformalizer-7B [1] to autoformalize selected problems from the Numina Math 1.5 dataset. However, our analysis reveals that both the open-sourced formal problems and those generated via autoformalization exhibit several recurring issues, such as tautological statements, contradictory premises, irrelevant conclusions and etc.To ensure data quality, we employ DeepSeek-V3 [7] to filter out low-quality problems.
Subsequently, we construct a cold start dataset through two parallel processes:
We adopt the following two-stage training approach:
While the cold-started model demonstrated basic environment interaction capabilities, it exhibited difficulty in effectively integrating the distinct response patterns from Claude and Kimina styles. To address this challenge, we developed a novel fusion approach:
During reinforcement learning stage, our rollout process integrates the collaboration of a LLM with a sandbox-style environment, enabling the production of hybrid content that combines text, Lean 4 code snippets enclosed by and , and real-time environment feedback enclosed by and . Concretely, the generation starts with text-based reasoning . Then the model proposes the first Lean 4 solution to the problem. When a sketch termination trigger ( ) is detected, the generation pause and the Lean 4 solution is parsed and sent to sandbox environment for execution. Upon completion, the sandbox’s output (successful results or error messages) is filled within tags and feedback to the model, which continues generating the rollout until either providing a final answer or producing a new Lean 4 solution, ultimately producing a hybrid reasoning trajectory .
When the model believes it has reached a final answer, it is required to generate a special delimiter and place its final Lean 4 solution after this marker, enabling reliable rule-based verification. We evaluate the correctness of the code segment following to assign the trajectory reward: if the proof passes REPL verification, the reward is set to 1; otherwise, 0. We perform reinforcement learning from the instruction-tuned model using GRPO, based on the tool-integrated rollout and reward setup.

Figure 3: Training reward and test accuracy (pass@1) of the tool-integrated GRPO.
Upon convergence of the tool-integrated reinforcement learning, we collect correct trajectories from problems scoring below 0.5 accuracy during training. These trajectories are then filtered to retain only those demonstrating comprehensive REPL feedback analysis. The selected trajectories subsequently undergo another round of supervised fine-tuning. Finally, we implement a secondary reinforcement learning phase to further enhance the model’s tool-integrated reasoning capabilities.
We evaluate our models on Numina’s corrected miniF2F benchmark from their HuggingFace repository. StepFun-Prover-Preview-32B achieves pass@1, surpassing prior SOTA results with a smaller model size, as shown in Table 1. This demonstrates the effectiveness of our tool-integrated reasoning approach in low-pass settings.
| Model | MiniF2F-test (Pass@1) |
|---|---|
| DeepSeek-Prover-V2-7B | |
| DeepSeek-Prover-V2-671B | |
| Kimina-Prover-8B | |
| Kimina-Prover-72B | |
| StepFun-Prover-Preview-7B | |
| StepFun-Prover-Preview-32B |
Tabel 1: Performance comparison of theorem proving models on the miniF2F-test dataset under minimal sampling budgets. For StepFun-Provers, we generate 32 responses per query to estimate Pass@1.
Tool-integrated reasoning models scale performance with test-time compute. As shown in Table 2, StepFun-Prover-Preview-32B shows consistent performance improvements on the miniF2F-test benchmark when we extend the maximum generation length.
| Maximum generation length | MiniF2F-test (Pass@1) |
|---|---|
| 4096 | |
| 8192 | |
| 12288 | |
| 16384 | |
| 20480 |
Table 2: Performance of StepFun-Prover-Preview-32B on MiniF2F-test with various maximum generation lengths. Longer response can get better results.
As shown in Figure 4, the REPL interaction distribution for successful proofs on the miniF2F-test benchmark (using StepFun-Prover-Preview-32B with 32 responses per problem) reveals the robust ability of StepFun-Prover-Preview to solve problems through extended interaction sequences during reasoning process, demonstrating that correct solutions can emerge after multiple rounds of refinement.

Figure 4: REPL interaction frequency distribution.
Through our tool-integrated reinforcement learning framework with cold start, StepFun-Prover-Preview develops several advanced reasoning capabilities that leverage real-time environment feedback. Here are some StepFun-Prover outputs demonstrating its diverse reasoning patterns.