Step3 is our cutting-edge multimodal reasoning model—built on a Mixture-of-Experts architecture with 321B total parameters and 38B active. It is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision–language reasoning. Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), Step3 maintains exceptional efficiency across both flagship and low-end accelerators.
And now, it’s within your reach. Today, we are releasing:
Step3— enabling grounded multimodal reasoning with accurate visual interpretation and reduced hallucination
Let’s push the boundaries—your innovation starts here.
During pretraining, Step3 processes over 20T text tokens across more than ten languages, of which 3.7T high-quality tokens are reserved for the annealing phases, while incorporating 4T image-text mixed tokens fueled multimodal training.
Delivering data at this scale requires sophisticated interplay between data engineering and data science. The raw data is sourced from a diverse array of web content and collaboratively licensed publisher material, then efficiently extracted and normalized using our in-house document parsers.
To address noise and redundancy, each document passes through a multi-stage understanding framework. A suite of over 10 specialized NLP models evaluates content quality (e.g., defects, toxicity, informational value), identifies high-level subjects (e.g., STEM, literature, code, math), and classifies documents into more than 50 fine-grained domains. We then apply MinHash-based deduplication to eliminate near-duplicates, followed by domain-aware down-sampling to ensure the annealing dataset remains both balanced and diverse. In total, this pipeline has processed over 100B source documents using in-house distributed CPU/GPU clusters.
Beyond that, we conduct large-scale case studies and extensive ablation experiments to refine filtering thresholds and sampling strategies, ultimately shaping the final training “data recipe”. This rigorous curation results in a high-quality, diverse dataset that not only supports robust model performance but also lays a strong foundation for our multimodal data corpus, as detailed in later sections.
In designing the model architecture, we prioritize optimizing decoding for several key reasons:
The development of Step3’s architecture followed a model-system co-design approach, in which algorithmic innovations—particularly in attention mechanisms and Mixture-of-Experts (MoE) architectures—were closely aligned with hardware characteristics and deployment constraints.
MFA: At the heart of Step 3’s efficiency lies our novel Multi-Matrix Factorization Attention (MFA)[2]. Step3 reduces KV-cache demands and attention FLOPs—using just 22% of DeepSeek V3’s per-token attention cost—thereby making advanced inference significantly more affordable.
Step3 | DeepSeekV3 | Qwen3-235B | ERNIE4.5 | Qwen3 32B | |
---|---|---|---|---|---|
model_dim | 7168 | 7168 | 4096 | 8192 | 5120 |
Dense ffn_dim | 18432 | 18432 | 12288 | 28672 | 25600 |
Layer num (MoE Layer) | 61(56) | 61(58) | 94(94) | 54(51) | 64 |
query_head_num | 64 | 128 | 64 | 64 | 64 |
Head_size | 256 | 128 | 128 | 128 | 128 |
Attention Class | MFA | MLA | GQA-4 | GQA-8 | GQA-8 |
Expert Num-TopK | 3in48 | 8in256 | 8in128 | 8in64 | |
Dynamic Expert Dim (Share Expert) | 5120(5120) | 2048(2048) | 1536(0) | 3584(0) | |
Activated Params | 38B | 37B | 22B | 47B | 32B |
Total Params (LLM only) | 316B | 671B | 235B | 300B | |
KV Cache Size (length = 32k) | 1.02E+09 | 1.15E+09 | 3.15E+09 | 3.60E+09 | 4.30E+09 |
Attention Computation w/o Linear (FLOPs) | 1.31E+11 | 5.89E+11 | 1.01E+11 | 5.80E+10 | 6.87E+10 |
Arithmetic Intensity | 128 | 512 | 32 | 16 | 16 |
Step3 packs 321B parameters yet still runs on eight 48 GB GPUs, processing contexts up to 800K tokens (batch × length)* . This scale strikes a deliberate balance between performance and engineering cost and draws on state-of-the-art training recipes [3, 4].
* These total token size are calculated assuming int8 quantization for non-attention parameters, while other components, including the KV cache, remain in fp16 or bf16 precision.
16 spatial down-sampling lies at the core of Step3’s multimodal pathway. Built on Eva-CLIP 5B [5], our 5B-parameter vision encoder extracts dense image features, which are then compressed through two successive 2D convolutional layers, reducing the token grid to one-sixteenth of its original size. The resulting visual tokens merge seamlessly with text tokens before entering the Large Language Model (LLM), delivering a powerful yet compute-efficient representation underpinning Step3’s multimodal capabilities.
Multimodal training is divided into two stages. The first stage centers on training the vision encoder: we jointly optimize it with a compact LLM via next-token prediction on 3.5 trillion tokens from the Paired dataset and a subset of the Multi-Task data. In the second stage, the vision encoder is frozen, and the Connector and the LLM are trained on the full dataset, totaling 1.4T tokens.
Our multimodal dataset comprises 4T unique tokens, primarily consisting of Paired, Interleaved, and Multi-Task data. The Paired data consists of open-source datasets, web-scraped image-text pairs, and specialized domain-specific pairs, all of which undergo a rigorous cleaning process involving similarity filtering, rebalancing, and deduplication. Interleaved data from webpages, papers, books, and tutorials is cleaned based on metrics such as information density and image-text relevance. The Multi-Task data includes OCR, Table, Grounding, GUI, Video, VQA, exam questions, and reasoning data, with a substantial synthetic component. Overall, Chinese and English each account for about half, while the remaining ~5% cover other languages.
Our alignment pipeline unfolds in two phases. We begin with supervised fine-tuning on a curated conversations spanning multimodal mathematical reasoning, competitive programming, diverse STEM topics, and general non-reasoning tasks. Next, we perform reinforcement learning (RL): structured problems receive dense, automatically verified rewards, while open-ended tasks rely on preference-model or human feedback. A dedicated value network supplies reliable advantage estimates, keeping policy updates stable throughout training.
During supervised fine-tuning, every conversation must parse cleanly, sit within a healthy perplexity range, be free of near-duplicates, excessive token repetition, URL or image clutter, and high n-gram overlap. A lightweight quality scorer—integrating signals of toxicity, factuality, and length—serves as the final filter before samples are admitted into the fine-tuning dataset. Moreover, we have significantly enhanced our model’s agentic capabilities. Specifically, we employ a reverse synthesis process to generate complex queries involving reasoning and tool use, followed by refinement using a Directed Acyclic Graph (DAG), rejection sampling, and difficulty-based filtering.
The verifiable prompts feed into the RL phase. Here, an internal multimodal reasoning model predicts solution steps and assigns difficulty labels, ensuring that training includes a balanced mix of easy, medium, and hard cases. These problems span mathematics, programming, logic, and complex problem-solving, ranging from elementary education to frontier developments. They also include specialized tasks in multimodal perception and understanding, specifically designed to support agentic interaction.
LLMs face low hardware efficiency during decoding, especially for long-context reasoning tasks. Step3 employs a hardware-aware model-system co-design approach, tailoring its architecture to minimize decoding costs. Step3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both KV cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context.
Step3 achieves low cost while activating 38B parameters per token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD are critical to cost-effectiveness. We perform a head-to-head comparison with DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs achieves a decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3’s 2,324 in the same setup and sets a new Pareto frontier for LLM decoding.
The Pareto frontier of recent models regarding activated parameters and decoding costs. The darker area is GQA models’ Pareto frontier.
As introduced with MFA above, we now present the idea of Attention-FFN Disaggregation (AFD). To enable AFD efficiently, our system leverages two key techniques:
Multi-stage pipeline of step3 AFD system
StepMesh: GPU-Direct RDMA Library for Attention-FFN Disaggregation
Performance Results: Our results are promising, showing step3’s significant gain of token/GPU/sec compared to DeepSeek-V3. At a context length of 4096, we achieve 4000 peak TGS(Token/Gpu/Second), around 70% higher than DSv3’s reported number, with only 1/3 Hopper GPUs. Note that the TGS is achieved under strict 50ms decode time per token.
Model | Context Len (avg) | # Hopper GPUs | Peak TGS (token/gpu/sec) |
---|---|---|---|
DSv3-blog [6] | 4989 | 144 | 1850 |
DSv3-profile [7] | 4096 | 128 | 2324 |
Step3 (bf16 attn) | 4096 | 40 (3A2F) | 3321 |
step3 (fp8 attn) | 4096 | 32 (2A2F) | 4039 |
step3 (fp8 attn) | 8192 | 48 (4A2F) | 2643 |
Performance comparison with reported number of DSv3 under 20 tokens/s decoding SLA. TGS: Tokens/GPU/s.
In the process of scaling Step3’s MoE architecture, we observed a new failure mode we have termed the “dead expert” phenomenon. In this case, certain dynamic experts become effectively inactive—not due to routing imbalance, but because their output weight norms vanish during training. This leads to negligible contribution during the model’s forward pass, even though tokens are still being routed to them, which is distinct from the more commonly discussed “router collapse”. The root causes behind this phenomenon are still being actively investigated. We’ll share further insights as our research progresses.
The current state of our Step 3, while powerful in many respects, is underoptimised for vibe coding. Additionally, prolonged multimodal reasoning training reveals a clear trade-off: as the model’s textual reasoning ability improves, its visual-perception accuracy deteriorates.We are actively working to mitigate these limitations.
[1] https://github.com/stepfun-ai/Step3/blob/main/Step3-Sys-Tech-Report.pdf
[2] Jingcheng Hu, Houyi Li, Yinmin Zhang, Zili Wang, Shuigeng Zhou, Xiangyu Zhang, and Heung-Yeung Shum. 2025. Multi-matrix Factorization Attention. In Findings of the Association for Computational Linguistics: ACL 2025, pages 25114–25126, Vienna, Austria. Association for Computational Linguistics.
[3] Houyi Li, Wenzhen Zheng, Qiufeng Wang, Zhenyu Ding, Haoying Wang, Zili Wang, et al. 2025. Predictable Scale: Part II, Farseer: A Refined Scaling Law in Large Language Models. arXiv preprint arXiv:2506.10972.
[4] Houyi Li, Wenzhen Zheng, Qiufeng Wang, Hanshan Zhang, Zili Wang, et al. 2025. Predictable Scale: Part I, Step Law — Optimal Hyperparameter Scaling Law in Large Language Model Pretraining. arXiv preprint arXiv:2503.04715.
[5] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389.
[6] https://github.com/deepseek-ai/open-infra-index/tree/main/202502OpenSourceWeek
[7] https://github.com/deepseek-ai/profile-data/