login
Step3: Cost-Effective Multimodal Intelligence

Introduction

Step3 is our cutting-edge multimodal reasoning model—built on a Mixture-of-Experts architecture with 321B total parameters and 38B active. It is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision–language reasoning. Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), Step3 maintains exceptional efficiency across both flagship and low-end accelerators.

And now, it’s within your reach. Today, we are releasing:

Step3— enabling grounded multimodal reasoning with accurate visual interpretation and reduced hallucination

Let’s push the boundaries—your innovation starts here.

Overall Performance

Pretrain

Pretrain Data

During pretraining, Step3 processes over 20T text tokens across more than ten languages, of which 3.7T high-quality tokens are reserved for the annealing phases, while incorporating 4T image-text mixed tokens fueled multimodal training.

Delivering data at this scale requires sophisticated interplay between data engineering and data science. The raw data is sourced from a diverse array of web content and collaboratively licensed publisher material, then efficiently extracted and normalized using our in-house document parsers.

To address noise and redundancy, each document passes through a multi-stage understanding framework. A suite of over 10 specialized NLP models evaluates content quality (e.g., defects, toxicity, informational value), identifies high-level subjects (e.g., STEM, literature, code, math), and classifies documents into more than 50 fine-grained domains. We then apply MinHash-based deduplication to eliminate near-duplicates, followed by domain-aware down-sampling to ensure the annealing dataset remains both balanced and diverse. In total, this pipeline has processed over 100B source documents using in-house distributed CPU/GPU clusters.

Beyond that, we conduct large-scale case studies and extensive ablation experiments to refine filtering thresholds and sampling strategies, ultimately shaping the final training “data recipe”. This rigorous curation results in a high-quality, diverse dataset that not only supports robust model performance but also lays a strong foundation for our multimodal data corpus, as detailed in later sections.

Pretrain Model Architecture

In designing the model architecture, we prioritize optimizing decoding for several key reasons:

  1. Decoding is the most expensive phase per token due to its low model FLOPs utilization (MFU), especially compared to training and prefill.
  2. Because reasoning models grow more capable with longer chains of thought, making decoding more efficient and less costly lets the same compute budget support deeper reasoning—and, consequently, greater intelligence.

The development of Step3’s architecture followed a model-system co-design approach, in which algorithmic innovations—particularly in attention mechanisms and Mixture-of-Experts (MoE) architectures—were closely aligned with hardware characteristics and deployment constraints.

MFA: At the heart of Step 3’s efficiency lies our novel Multi-Matrix Factorization Attention (MFA)[2]. Step3 reduces KV-cache demands and attention FLOPs—using just 22% of DeepSeek V3’s per-token attention cost—thereby making advanced inference significantly more affordable.

Step3DeepSeekV3Qwen3-235BERNIE4.5Qwen3 32B
model_dim71687168409681925120
Dense ffn_dim1843218432122882867225600
Layer num (MoE Layer)61(56)61(58)94(94)54(51)64
query_head_num64128646464
Head_size256128128128128
Attention ClassMFAMLAGQA-4GQA-8GQA-8
Expert Num-TopK3in488in2568in1288in64
Dynamic Expert Dim (Share Expert)5120(5120)2048(2048)1536(0)3584(0)
Activated Params38B37B22B47B32B
Total Params (LLM only)316B671B235B300B
KV Cache Size (length = 32k)1.02E+091.15E+093.15E+093.60E+094.30E+09
Attention Computation w/o Linear (FLOPs)1.31E+115.89E+111.01E+115.80E+106.87E+10
Arithmetic Intensity128512321616

Step3 packs 321B parameters yet still runs on eight 48 GB GPUs, processing contexts up to 800K tokens (batch × length)* . This scale strikes a deliberate balance between performance and engineering cost and draws on state-of-the-art training recipes [3, 4].

* These total token size are calculated assuming int8 quantization for non-attention parameters, while other components, including the KV cache, remain in fp16 or bf16 precision.

Multimodal

16 spatial down-sampling lies at the core of Step3’s multimodal pathway. Built on Eva-CLIP 5B [5], our 5B-parameter vision encoder extracts dense image features, which are then compressed through two successive 2D convolutional layers, reducing the token grid to one-sixteenth of its original size. The resulting visual tokens merge seamlessly with text tokens before entering the Large Language Model (LLM), delivering a powerful yet compute-efficient representation underpinning Step3’s multimodal capabilities.

Multimodal training is divided into two stages. The first stage centers on training the vision encoder: we jointly optimize it with a compact LLM via next-token prediction on 3.5 trillion tokens from the Paired dataset and a subset of the Multi-Task data. In the second stage, the vision encoder is frozen, and the Connector and the LLM are trained on the full dataset, totaling 1.4T tokens.

Our multimodal dataset comprises 4T unique tokens, primarily consisting of Paired, Interleaved, and Multi-Task data. The Paired data consists of open-source datasets, web-scraped image-text pairs, and specialized domain-specific pairs, all of which undergo a rigorous cleaning process involving similarity filtering, rebalancing, and deduplication. Interleaved data from webpages, papers, books, and tutorials is cleaned based on metrics such as information density and image-text relevance. The Multi-Task data includes OCR, Table, Grounding, GUI, Video, VQA, exam questions, and reasoning data, with a substantial synthetic component. Overall, Chinese and English each account for about half, while the remaining ~5% cover other languages.

Post-Train

Our alignment pipeline unfolds in two phases. We begin with supervised fine-tuning on a curated conversations spanning multimodal mathematical reasoning, competitive programming, diverse STEM topics, and general non-reasoning tasks. Next, we perform reinforcement learning (RL): structured problems receive dense, automatically verified rewards, while open-ended tasks rely on preference-model or human feedback. A dedicated value network supplies reliable advantage estimates, keeping policy updates stable throughout training.

During supervised fine-tuning, every conversation must parse cleanly, sit within a healthy perplexity range, be free of near-duplicates, excessive token repetition, URL or image clutter, and high n-gram overlap. A lightweight quality scorer—integrating signals of toxicity, factuality, and length—serves as the final filter before samples are admitted into the fine-tuning dataset. Moreover, we have significantly enhanced our model’s agentic capabilities. Specifically, we employ a reverse synthesis process to generate complex queries involving reasoning and tool use, followed by refinement using a Directed Acyclic Graph (DAG), rejection sampling, and difficulty-based filtering.

The verifiable prompts feed into the RL phase. Here, an internal multimodal reasoning model predicts solution steps and assigns difficulty labels, ensuring that training includes a balanced mix of easy, medium, and hard cases. These problems span mathematics, programming, logic, and complex problem-solving, ranging from elementary education to frontier developments. They also include specialized tasks in multimodal perception and understanding, specifically designed to support agentic interaction.

Infrastructures

LLMs face low hardware efficiency during decoding, especially for long-context reasoning tasks. Step3 employs a hardware-aware model-system co-design approach, tailoring its architecture to minimize decoding costs. Step3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both KV cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context.

Step3 achieves low cost while activating 38B parameters per token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD are critical to cost-effectiveness. We perform a head-to-head comparison with DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs achieves a decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3’s 2,324 in the same setup and sets a new Pareto frontier for LLM decoding.

The Pareto frontier of recent models regarding activated parameters and decoding costs. The darker area is GQA models’ Pareto frontier.

The Pareto frontier of recent models regarding activated parameters and decoding costs. The darker area is GQA models’ Pareto frontier.

As introduced with MFA above, we now present the idea of Attention-FFN Disaggregation (AFD). To enable AFD efficiently, our system leverages two key techniques:

  1. Multi-Stages Pipeline: Since attention and FFN layers are physically separated, it is critical to run them in parallel in order to maximize overall throughput. As shown in the figure, the attention instance receives and processes three input samples sequentially. For each processed sample, the attention instance sends the intermediate compute results to the FFN instance and wait for the corresponding outputs. The entire process is in a streaming manner, and hence the GPU utilization of both sides can be fully saturated, wasting no GPU cycle during the stable state.
    Multi-stage pipeline of step3 AFD system

Multi-stage pipeline of step3 AFD system

  1. Efficient GPU Direct Communication: The stringent SLA requirement of Step3 inference requires not only high throughput but also low latency. We develop StepMesh, a specialized communication library for AFD based on GPUDirect RDMA, offering ultra-low latency, zero SM usage, and flexible M-to-N communication. Its interfaces are also friendly to support other heterogeneous accelerators besides GPUs which we will continuously working on. StepMesh is open-sourced at [GitHub - stepfun-ai/StepMesh](GitHub - stepfun-ai/StepMesh) .
    StepMesh: GPU-Direct RDMA Library for Attention-FFN Disaggregation

StepMesh: GPU-Direct RDMA Library for Attention-FFN Disaggregation

Performance Results: Our results are promising, showing step3’s significant gain of token/GPU/sec compared to DeepSeek-V3. At a context length of 4096, we achieve 4000 peak TGS(Token/Gpu/Second), around 70% higher than DSv3’s reported number, with only 1/3 Hopper GPUs. Note that the TGS is achieved under strict 50ms decode time per token.

ModelContext Len (avg)# Hopper GPUsPeak TGS (token/gpu/sec)
DSv3-blog [6]49891441850
DSv3-profile [7]40961282324
Step3 (bf16 attn)409640 (3A2F)3321
step3 (fp8 attn)409632 (2A2F)4039
step3 (fp8 attn)819248 (4A2F)2643

Performance comparison with reported number of DSv3 under 20 tokens/s decoding SLA. TGS: Tokens/GPU/s.

Known issues

  1. In the process of scaling Step3’s MoE architecture, we observed a new failure mode we have termed the “dead expert” phenomenon. In this case, certain dynamic experts become effectively inactive—not due to routing imbalance, but because their output weight norms vanish during training. This leads to negligible contribution during the model’s forward pass, even though tokens are still being routed to them, which is distinct from the more commonly discussed “router collapse”. The root causes behind this phenomenon are still being actively investigated. We’ll share further insights as our research progresses.

  2. The current state of our Step 3, while powerful in many respects, is underoptimised for vibe coding. Additionally, prolonged multimodal reasoning training reveals a clear trade-off: as the model’s textual reasoning ability improves, its visual-perception accuracy deteriorates.We are actively working to mitigate these limitations.

Citation

plaintext

References

[1] https://github.com/stepfun-ai/Step3/blob/main/Step3-Sys-Tech-Report.pdf
[2] Jingcheng Hu, Houyi Li, Yinmin Zhang, Zili Wang, Shuigeng Zhou, Xiangyu Zhang, and Heung-Yeung Shum. 2025. Multi-matrix Factorization Attention. In Findings of the Association for Computational Linguistics: ACL 2025, pages 25114–25126, Vienna, Austria. Association for Computational Linguistics.
[3] Houyi Li, Wenzhen Zheng, Qiufeng Wang, Zhenyu Ding, Haoying Wang, Zili Wang, et al. 2025. Predictable Scale: Part II, Farseer: A Refined Scaling Law in Large Language Models. arXiv preprint arXiv:2506.10972.
[4] Houyi Li, Wenzhen Zheng, Qiufeng Wang, Hanshan Zhang, Zili Wang, et al. 2025. Predictable Scale: Part I, Step Law — Optimal Hyperparameter Scaling Law in Large Language Model Pretraining. arXiv preprint arXiv:2503.04715.
[5] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389.
[6] https://github.com/deepseek-ai/open-infra-index/tree/main/202502OpenSourceWeek
[7] https://github.com/deepseek-ai/profile-data/