login
Meet NextStep-1: A New Leap in Autoregressive Image Generation
Check out our latest, most versatile and powerful autoregressive image generation model that rivals the state-of-the-art diffusion-based systems.
Open Source:Github
Hugging Face

The world of AI-powered image generation is moving at lightning speed. Today, we’re thrilled to introduce NextStep-1, a groundbreaking model from StepFun that pushes the boundaries of what’s possible in creating and editing images from text. NextStep-1 stands out by forging a new path for autoregressive (AR) models, achieving state-of-the-art results that rival even the most powerful diffusion-based systems.

t2i_demo

NextStep-1 delivers high-fidelity text-to-image generation while also offering powerful image editing capabilities. It supports a wide range of editing operations — such as object addition/removal, background modification, action changes, and style transfer — and can understand everyday natural language instructions, enabling flexible and free-form image editing.

edit_demo

A Fresh Approach to Image Generation

For a long time, autoregressive models have achieved remarkable success in language tasks[1-3], but struggle when it comes to image generation. Previous models [4-10] had to either bolt on heavy external diffusion modules, or convert images into discrete (and often lossy) tokens via vector quantization (VQ) [11-13].

NextStep-1 charts a new course. This 14B-parameter purely-autoregressive model achieves state-of-the-art image generation quality with an extremely lightweight flow-matching head, and works directly with continuous image tokens, preserving the full richness of visual data instead of compressing it into a limited set of discrete visual words.

model_arch

Under the hood, NextStep-1 employs a specially tuned autoencoder to tokenize images into continuous, patchwise latent tokens and sequentialize them alongside text tokens. A causal Transformer backbone processes this sequence uniformly, while a 157M-parameter flow-matching [14] head directly predicts the next continuous image token at visual positions. We find this unified next-token paradigm to be straightforward, scalable, and sufficient for delivering high-fidelity, highly detailed images.


Benchmark Performance

NextStep-1 demonstrates outstanding performance across challenging benchmarks, covering a broad spectrum of capabilities.

  • Prompt Following: On GenEval[15], NextStep-1 achieves a competitive score of 0.63 (w/o self-CoT) and 0.73 (w/ self-CoT) . On GenAI-Bench[16], a benchmark that tests compositional abilities, NextStep-1 scores 0.67 on advanced prompts and 0.88 on basic ones, demonstrating a strong ability to understand and render complex scenes. On DPG-Bench[17], which uses long, detailed prompts, NextStep-1 achieves a score of 85.28, confirming its reliability in handling complex user requests.
MethodGenEval↑GenAI-Bench↑ (Basic)GenAI-Bench↑ (Advanced)DPG-Bench↑
Proprietary
DALL·E 3 (Betker et al., 2023)0.670.900.7083.50
Seedream 3.0 (Gao et al., 2025)0.8488.27
GPT4o (OpenAI, 2025b)0.8485.15
Diffusion
Stable Diffusion 1.5 (Rombach et al., 2022)0.43
Stable Diffusion XL (Podell et al., 2024)0.550.830.6374.65
Stable Diffusion 3 Medium (Esser et al., 2024)0.740.880.6584.08
Stable Diffusion 3.5 Large (Esser et al., 2024)0.710.880.6683.38
PixArt-Alpha (Chen et al., 2024)0.4871.11
Flux-1-dev (Labs, 2024)0.660.860.6583.79
Transfusion (Zhou et al., 2025)0.63
CogView4 (Z.ai, 2025)0.7385.13
Lumina-Image 2.0 (Qin et al., 2025)0.7387.20
HiDream-I1-Full (Cai et al., 2025)0.830.910.6685.89
Mogao (Liao et al., 2025)0.890.6884.33
BAGEL (Deng et al., 2025)0.82 / 0.88†0.89 / 0.86†0.69 / 0.75†85.07
Show-o2-7B (Xie et al., 2025b)0.7686.14
OmniGen2 (Wu et al., 2025b)0.80 / 0.86*83.57
Qwen-Image (Wu et al., 2025a)0.8788.32
AutoRegressive
SEED-X (Ge et al., 2024)0.490.860.70
Show-o (Xie et al., 2024)0.530.700.60
VILA-U (Wu et al., 2024)0.760.64
Emu3 (Wang et al., 2024b)0.54 / 0.65*0.780.6080.60
SimpleAR (Wang et al., 2025c)0.6381.97
Fluid (Fan et al., 2024)0.69
Infinity (Han et al., 2025)0.7986.60
Janus-Pro-7B (Chen et al., 2025b)0.800.860.6684.19
Token-Shuffle (Ma et al., 2025b)0.620.780.67
NextStep-10.63 / 0.73†0.88 / 0.90†0.67 / 0.74†85.28

* results are with prompt rewrite, † results are with self-CoT.


  • World Knowledge: In the WISE[18] benchmark, which evaluates a model’s ability to integrate real-world knowledge into images, NextStep-1 achieves an overall score of 0.54, outperforming most diffusion models and all other autoregressive models.
ModelCulturalTimeSpaceBiologyPhysicsChemistryOverall↑Overall (Rewrite)↑
Proprietary
GPT-4o (OpenAI, 2025b)0.810.710.890.830.790.740.80
Diffusion
Stable Diffusion 1.5 (Rombach et al., 2022)0.340.350.320.280.290.210.320.50
Stable Diffusion XL (Podell et al., 2024)0.430.480.470.440.450.270.430.65
Stable Diffusion 3.5 Large (Stability-AI, 2024)0.440.500.580.440.520.310.460.72
PixArt-Alpha (Chen et al., 2024)0.450.500.480.490.560.340.470.63
Playground v2.5 (Li et al., 2024b)0.490.580.550.430.480.330.490.71
Flux.1-dev (Labs, 2024)0.480.580.620.420.510.350.500.73
MetaQuery-XL (Pan et al., 2025)0.560.550.620.490.630.410.55
BAGEL (Deng et al., 2025)0.44 / 0.76†0.55 / 0.69†0.68 / 0.75†0.44 / 0.65†0.60 / 0.75†0.39 / 0.58†0.52 / 0.70†0.71 / 0.77†
Qwen-Image (Wu et al., 2025a)0.620.630.770.570.750.400.62
AutoRegressive
Show-o-512 (Xie et al., 2024)0.280.400.480.300.460.300.64
VILA-U (Wu et al., 2024)0.260.330.370.350.390.23
Emu3 (Wang et al., 2024b)0.340.450.480.410.450.270.63
Janus-Pro-7B (Chen et al., 2025b)0.300.370.490.360.420.260.71
NextStep-10.51 / 0.70†0.54 / 0.65†0.61 / 0.69†0.52 / 0.63†0.63 / 0.73†0.48 / 0.52†0.54 / 0.67†0.79 / 0.83†

* results are with prompt rewrite, † results are with self-CoT.


  • Image Editing: Our instruction-based editing model, NextStep-1-Edit, also shows competitive performance, scoring 6.58 on GEdit-Bench[19] and 3.71 on ImgEdit-Bench[20].
ModelGEdit-Bench-EN (Full Set)↑GEdit-Bench-CN (Full Set)↑ImgEdit-Bench↑
G_SCG_PQG_OG_SCG_PQG_O
Proprietary
Gemini 2.0 (Gemini2, 2025)6.877.446.515.267.605.14
Doubao (Shi et al., 2024)7.227.896.987.177.796.84
GPT-4o (OpenAI, 2025b)7.748.137.497.528.027.304.20
Flux.1-Kontext-pro (Labs et al., 2025)7.027.606.561.117.361.23
Open-source
Instruct-Pix2Pix (Brooks et al., 2023)3.306.193.221.88
MagicBrush (Zhang et al., 2023a)4.526.374.191.83
AnyEdit (Yu et al., 2024a)3.055.882.852.45
OmniGen (Xiao et al., 2024)5.885.875.012.96
OmniGen2 (Wu et al., 2025b)7.166.776.413.44
Step1X-Edit v1.0 (Liu et al., 2025)7.137.006.447.307.146.663.06
Step1X-Edit v1.1 (Liu et al., 2025)7.667.356.977.657.406.98
BAGEL (Deng et al., 2025)7.366.836.527.346.856.503.42
Flux.1-Kontext-dev (Labs et al., 2025)6.263.71
GPT-Image-Edit (Wang et al., 2025d)7.243.80
NextStep-17.157.016.586.887.026.403.71

Model Efficiency

Key Insights and Discoveries

Building a purely autoregressive model for images is no easy feat. That’s why we’re not just releasing a strong foundation model for the community — we’re also sharing the key insights we gained along the way. Our hope is that these lessons will shed light on how autoregressive really works when it comes to image generation.

Causal Transformer CAN be the Real Artist

For a long time, researchers have questioned whether a causal transformer could truly handle autoregressive image generation on its own — without relying on vector quantization or offloading much of the generation process to heavyweight external diffusers. With NextStep-1, we demonstrate that, with the right image tokenization and training strategies, an LLM-style transformer can be the primary creative engine.

To investigate this, we test flow-matching heads of significantly different scales—40M, 157M, and 528M parameters—and find that image quality remains largely unaffected by the head size. This fascinating discovery strongly suggests that the transformer backbone is doing the heavy lifting, driving the core generative modeling and high-level reasoning. In NextStep-1, the flow-matching head functions more like a lightweight sampler, converting the transformer’s rich contextual predictions into the final image patches.

fm_head

Tokenizer is the Key to Stability and Quality

When working with continuous image tokens, the tokenizer is the beating heart of stability and visual fidelity in an autoregressive pipeline. Among the “secret recipes” that make NextStep-1 work so well, two key insights stand out:

  • Channel-Wise Normalization Brings Stability: Push the classifier-free guidance (CFG) scale high enough, and many models start showing strange artifacts — warped textures, ghost shapes, and inconsistent colors. Our findings suggest that the culprit lies in a statistical drift in the generated tokens. The fix is simple yet effective: apply channel-wise normalization inside the tokenizer. This keeps token statistics stable under high CFG, allowing NextStep-1 to produce sharp, artifact-free images even when the guidance dial is turned all the way up.
mean_var
  • More Noise, More Quality: Counterintuitively, we find that adding more noise during tokenizer training — even though it increases reconstruction error — ultimately improves the quality of images produced by the autoregressive model. We suggest that this operation makes the latent space far more robust and evenly distributed, which, in turn, gave the autoregressive model a cleaner, more learnable starting point.

Picturing the NextStep Towards Multimodal Generation

We believe NextStep-1 means more than just a powerful image generation model. It unlocks the potential of pure causal transformers to generate tokens with continuous nature, and charts a promising path for the next step of multi-modal generation. By releasing both the model and our technical report to the community, we aim to inspire further research, foster collaboration, and accelerate progress in this exciting frontier.


References

[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems (NeurIPS), 2020.
[2] OpenAI. Introducing gpt-4.1 in the api. OpenAI Blog, 2025a. URL https://openai.com/index/gpt-4-1.
[3] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language understanding by generative pre-training. San Francisco, CA, USA, 2018.
[4] X. Chen, C. Wu, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, and P. Luo. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025.
[5] R. Dong, C. Han, Y. Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. In International Conference on Learning Representations (ICLR), 2024.
[6] Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, and X. Wang. Emu: Generative pretraining in multimodality. In International Conference on Learning Representations (ICLR), 2023.
[7] Q. Sun, Y. Cui, X. Zhang, F. Zhang, Q. Yu, Y. Wang, Y. Rao, J. Liu, T. Huang, and X. Wang. Generative multimodal models are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[8] Y. Sun, H. Bao, W. Wang, Z. Peng, L. Dong, S. Huang, J. Wang, and F. Wei. Multimodal latent language modeling with next-token diffusion. arXiv preprint arXiv:2412.08635, 2024.
[9] X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arxiv:2409.18869, 2024
[10] J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan, B. Hutchinson, W. Han, Z. Parekh, X. Li, H. Zhang, J. Baldridge, and Y. Wu. Scaling autoregressive models for content-rich text-to-image generation. In TMLR, 2022.
[11] S. M. A. Eslami, S. Liu, A. v. d. Oord, O. Vinyals, M. J. Wainwright, and I. Sutskever. Taming transformers for high-resolution image synthesis. In International Conference on Machine Learning (ICML), 2021.
[12] L. Yu, J. Lezama, N. B. Gundavarapu, L. Versari, K. Sohn, D. Minnen, Y. Cheng, V. Birodkar, A. Gupta, X. Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023.
[13] C. Zheng, T.-L. Vuong, J. Cai, and D. Phung. Movq: Modulating quantized vectors for high-fidelity image generation. Advances in Neural Information Processing Systems (NeurIPS), 2022.
[14] Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2023b.
[15] D. Ghosh, H. Hajishirzi, and L. Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. In Advances in neural information processing systems (NeurIPS), 2023.
[16] B. Li, Z. Lin, D. Pathak, J. Li, Y. Fei, K. Wu, X. Xia, P. Zhang, G. Neubig, and D. Ramanan. Evaluating and improving compositional text-to-visual generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[17] X. Hu, R. Wang, Y. Fang, B. Fu, P. Cheng, and G. Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135, 2024.
[18] Y. Niu, M. Ning, M. Zheng, B. Lin, P. Jin, J. Liao, K. Ning, B. Zhu, and L. Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265, 2025.
[19] S. Liu, Y. Han, P. Xing, F. Yin, R. Wang, W. Cheng, J. Liao, Y. Wang, H. Fu, C. Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025.
[20] Y. Ye, X. He, Z. Li, B. Lin, S. Yuan, Z. Yan, B. Hou, and L. Yuan. Imgedit: A unified image editing dataset and benchmark. arXiv preprint arXiv:2505.20275, 2025.