login
Step-Audio 2: Breakthrough in End-to-End Speech Modeling Technology
Step-Audio 2 is an end-to-end multimodal large language model engineered for industrial applications. This model innovatively integrates a latent space audio encoder with audio reinforcement learning. It captures paralinguistic information and vocal style features while adopting a CoT-reinforcement learning optimization strategy. Step-Audio 2 delivers high-performance dialogue capabilities across diverse scenarios. Experimental results confirm state-of-the-art (SOTA) performance on multiple audio comprehension and dialogue benchmarks.

I. Challenges Addressed

Large audio-language models (LALMs) face significant challenges:

  • Ineffective paralinguistic modeling: Existing solutions struggle to capture intonation, emotion, and vocal states while overemphasizing semantic content
  • Severe end-to-end hallucination: Current architectures lack access to real-world textual and acoustic knowledge bases

Step-Audio 2 resolves these through three innovations:

  1. Genuine end-to-end architecture: Direct raw audio processing enabling effective paralinguistic comprehension
  2. COT-reinforcement learning fusion: First model with audio reasoning capabilities for precise understanding of non-textual signals
  3. Acoustic knowledge enhancement: Leverages web search and audio retrieval to eliminate hallucinations while enabling dynamic voice switching

Demos

Hello, I’m Xiao Yue, your intelligent assistant companion.

Companion Mode: Paralinguistic Comprehension

Voice Analysis: Recognition Capabilities

Cognitive Expertise: Learning & Computation

Multilingual Mastery: Dialect Comprehension

Dynamic Voice: Tone Switching

Narrative Generation: Creative Expression

Emotional Intelligence: Dialogue Reasoning

II. Technical Advantages: Innovations and Breakthroughs

Compared to existing solutions, Step-Audio 2 introduces:

2.1 Architecture Innovation

End-to-end architecture
Fig. 1: Step-Audio 2 end-to-end architecture

  • Genuine end-to-end processing ⇨ Eliminates traditional ASR+LLM+TTS pipelines ⇨ Reduces latency and simplifies architecture
  • Continuous input/discrete output paradigm ⇨ Processes raw waveforms to prevent feature loss ⇨ Ensures synthesis stability via discrete acoustic tokens
  • Interleaved modality alignment ⇨ Implements fixed-ratio text-speech token interlacing ⇨ Ensures tight modality alignment while maximizing cognitive ceiling

2.2 Data Engineering

  • Multi-million hour acoustic corpus ⇨ Training spans languages, scenarios, and device environments
  • High-emotion dialogue synthesis pipeline

2.3 Highlights

  • Acoustic reasoning capability ⇨ Industry-leading granular comprehension:
    • SOTA 78.86 on Chinese dialogue benchmark (URO-Bench)
    • 76.55% accuracy on 11 paralinguistic features (Step-SPQA)
    • 77.4% on MMAU benchmark, surpassing GPT-4o-Audio and Gemini-2.5-Pro
  • Real-time voice modulation ⇨ Voice-command triggered tone switching

III. Experience Step-Audio 2

You can live interact with Step-Audio 2 in the latest StepFun App, which will be released soon! Open the app and select the microphone icon for live interaction:

Stepwise AI voice dialogue

IV. Benchmark Performance

Benchmark summary

1. Public Leaderboards

  • SOTA on multiple ASR benchmarks:

    • English: #1 globally on Common Voice/LibriSpeech, #1 domestically on Fleurs-EN
    • Chinese: #1 globally on AISHELL-2/KeSpeech (heavy accent)/WenetSpeech, #1 domestically on Fleurs-zh
      SOTA performance across ASR benchmarks
  • MMAU: #1 globally, outperforming GPT-4o-Audio, Gemini-2.5-Pro, NVIDIA’s Audio Flamingo3, and specialist compact models: Omni-R1

    Global #1 on MMAU benchmark

  • URO-Bench: #1 globally for Chinese, #1 domestically for English

    URO-Bench: Global leader in Chinese, domestic leader in English

2. Self-built Benchmarks

  • Step-SPQA: Industry’s first paralinguistic comprehension benchmark, ranked #1 globally

    Global #1 on StepEval-Audio-Paralinguistic benchmark

  • Step-AudioToolcall: World’s first voice-enabled toolcall benchmark

    StepEval-Audio-Toolcall benchmark

Learn More