HyperAIHyperAI

Command Palette

Search for a command to run...

MOSS-Audio-Tokenizer: 향후 오디오 기반 모델을 위한 오디오 토크나이저의 확장

초록

이산형 음성 토크나이저는 대규모 언어 모델이 원천적인 음성 처리 및 생성 능력을 갖추도록 하는 데 핵심적인 역할을 한다. 최근의 발전에도 불구하고 기존의 접근 방식은 종종 사전 훈련된 인코더, 의미 증류(semantic distillation), 또는 다양한 구조를 가진 CNN 기반 아키텍처에 의존한다. 이러한 설계는 고정된 인덕티브 편향(inductive bias)을 도입하여 재구성 정밀도를 제한하고 효과적인 확장성 향상을 방해한다. 본 논문에서는 이산형 음성 토크나이징이 순수하게 종단 간(end-to-end)으로 동일한 구조를 가진 확장 가능한 아키텍처를 사용하여 학습되어야 한다고 주장한다. 이를 위해 우리는 먼저, 인코더, 양자화기(quantizer), 디코더를 모두 처음부터 공동 최적화하는 순수 Transformer 기반 아키텍처인 CAT(Causal Audio Tokenizer with Transformer)을 제안한다. 이 CAT 아키텍처를 기반으로, 300만 시간에 달하는 다양한 일반 음성 데이터를 기반으로 사전 훈련한 16억 파라미터를 갖춘 대규모 음성 토크나이저인 MOSS-Audio-Tokenizer를 개발하였다. 본 연구에서는 동일한 구조를 가진 인과적(causal) Transformer 블록들로 구성된 단순하고 완전히 종단 간 접근 방식이 원활하게 확장되며, 다양한 음성 영역에서 고정밀 재구성 성능을 지원함을 보였다. 음성, 소리, 음악 전반에 걸쳐 MOSS-Audio-Tokenizer는 다양한 비트레이트에서 기존의 코덱보다 일관되게 우수한 성능을 보이며, 규모가 증가함에 따라 예측 가능한 성능 향상을 나타냈다. 특히, 본 모델에서 도출된 이산형 토큰을 활용하여, 기존의 비자기적(non-autoregressive) 및 계단식(cascaded) 시스템을 능가하는 최초의 순수 자동 회귀적(TTS) 모델을 개발하였다. 더불어, 보조 인코더 없이도 경쟁 수준의 음성 인식(ASR) 성능을 달성할 수 있음을 확인하였다. 본 연구 결과는 CAT 아키텍처가 차세대 원천적 음성 기반 모델을 위한 통합적이고 확장 가능한 인터페이스로서의 가능성을 제시한다.

One-sentence Summary

MOSI.AI researchers propose CAT, a fully end-to-end, Transformer-based audio tokenizer, enabling high-fidelity reconstruction and scalable audio foundation models; their 1.6B-parameter MOSS-Audio-Tokenizer outperforms prior codecs across speech, music, and sound, and powers state-of-the-art autoregressive TTS and ASR without auxiliary encoders.

Key Contributions

  • We introduce CAT, a fully end-to-end, homogeneous Transformer-based architecture for discrete audio tokenization that jointly optimizes encoder, quantizer, and decoder from scratch, eliminating reliance on pretrained components or heterogeneous CNN designs to improve reconstruction fidelity and scalability.
  • We scale CAT into MOSS-Audio-Tokenizer, a 1.6B-parameter model trained on 3 million hours of diverse audio, which achieves state-of-the-art reconstruction across speech, sound, and music at all bitrates and exhibits predictable performance gains with scale.
  • Leveraging MOSS-Audio-Tokenizer’s tokens, we build the first purely autoregressive TTS system that outperforms non-autoregressive and cascaded baselines, and demonstrate competitive ASR performance without auxiliary encoders, validating CAT as a unified interface for audio foundation models.

Introduction

The authors leverage a fully end-to-end, homogeneous Transformer architecture—CAT—to build MOSS-Audio-Tokenizer, a scalable 1.6B-parameter audio tokenizer trained on 3 million hours of diverse audio. Prior tokenizers often rely on pretrained encoders, hybrid CNN-Transformer designs, or multi-stage training, which introduce fixed inductive biases that limit reconstruction fidelity and hinder scaling. MOSS-Audio-Tokenizer overcomes these by jointly optimizing encoder, quantizer, and decoder from scratch under a causal, streaming-friendly framework, achieving state-of-the-art reconstruction across speech, sound, and music at all bitrates. Its discrete tokens enable the first purely autoregressive TTS system to outperform non-autoregressive and cascaded baselines, and support competitive ASR without auxiliary encoders—positioning CAT as a unified foundation for scalable, native audio language models.

Dataset

The authors use a collection of 12 baseline audio tokenizers for evaluation, each sourced from official releases and configured for monophonic audio at either 16 kHz or 24 kHz. Key details:

  • Encodec: Official causal model at 24 kHz; ~14M parameters. Bitrate controlled by truncating RVQ layers during eval.
  • DAC: 24 kHz monophonic model; ~74M parameters. Uses engineered discriminators and improved VQ.
  • SpeechTokenizer: Trained on 16 kHz speech; ~103.67M parameters. Distills HuBERT via first RVQ layer for speech disentanglement.
  • Mimi: 24 kHz; outputs tokens at 12.5 Hz. Supports streaming encoding/decoding.
  • BigCodec: 16 kHz; single VQ codebook (size 8,192); 80 Hz frame rate; ~159M parameters.
  • Stable Codec: 16 kHz speech; uses RFSQ bottleneck; ~953M parameters. Eval uses 1x46656_400bps and 2x15625_700bps presets.
  • XCodec2.0: 16 kHz; integrates pre-trained speech encoder; 50 Hz frame rate; ~822M parameters.
  • XY-Tokenizer: 16 kHz; jointly models semantic/acoustic info via dual encoders; 12.5 Hz, 8-layer RVQ (codebook 1,024); ~519M parameters. Quantizer dropout disabled.
  • Higgs Audio Tokenizer: 24 kHz; ~201M parameters.
  • MiMo Audio Tokenizer: Trained on >11M hours; supports waveform reconstruction and language modeling; ~1.2B parameters.
  • Qwen3 TTS Tokenizer: 24 kHz; 12.5 Hz frame rate; ~170M parameters. Designed for streaming TTS.

For bitrate control during training, the authors apply Progressive Sequence Dropout to randomly truncate active RVQ layers. At inference, decoding uses only the first k RVQ tokens per timestep, and the Depth Transformer autoregressively predicts only those k tokens, omitting finer layers. All models are evaluated under their default or recommended configurations without further filtering or dataset composition beyond their original training data.

Method

The authors leverage a purely Transformer-based architecture, CAT (Causal Audio Tokenizer with Transformer), to achieve scalable, high-fidelity discrete audio tokenization without relying on convolutional inductive biases. The framework operates directly on raw waveforms and is designed for end-to-end training, supporting both semantic alignment and acoustic reconstruction. Refer to the framework diagram for an overview of the encoder-decoder structure and auxiliary components.

The encoder processes raw 24 kHz audio by first patchifying the waveform into fixed-dimensional vectors and then applying a stack of causal Transformer blocks. To progressively compress the temporal resolution, patchify operations are inserted after specific Transformer layers, reducing the sequence length and ultimately mapping the input to a discrete token sequence at 12.5 Hz. The decoder mirrors this structure in reverse, reconstructing the waveform from discrete tokens in a fully causal manner. Discretization is handled by a 32-layer residual vector quantizer (RVQ), which supports variable-bitrate tokenization via quantizer dropout during training. Each quantization layer employs factorized vector quantization with L2-normalized codebooks, and the codebook entries are optimized directly via gradient descent.

To encourage semantically rich representations, the authors attach a 0.5B-parameter decoder-only causal language model (LLM) that conditions on the quantizer’s hidden states. The LLM is trained on diverse audio-to-text tasks—including ASR, multi-speaker ASR, and audio captioning—using a task-specific prompt token prepended to the input. The semantic loss is computed as:

Lsem=t=1slogpθLLM(stT,q,s<t),\mathcal { L } _ { \mathrm { s e m } } = - \sum _ { t = 1 } ^ { | \mathbf { s } | } \log p _ { \theta _ { \mathrm { L L M } } } ( \mathbf { s } _ { t } \, | \, \mathcal { T } , \, \mathbf { q } , \, \mathbf { s } _ { < t } ) \, ,Lsem=t=1slogpθLLM(stT,q,s<t),

where s\mathbf{s}s is the target text sequence, q\mathbf{q}q is the quantized audio representation, and T\mathcal{T}T is the task tag.

Acoustic fidelity is ensured through a multi-scale mel-spectrogram loss:

Lrec=i=511S2i(x)S2i(x^)1,\mathcal { L } _ { \mathrm { r e c } } = \sum _ { i = 5 } ^ { 1 1 } \| S _ { 2 i } ( \mathbf { x } ) - S _ { 2 i } ( \hat { \mathbf { x } } ) \| _ { 1 } \, ,Lrec=i=511S2i(x)S2i(x^)1,

where S2i()S_{2i}(\cdot)S2i() denotes the mel-spectrogram computed with window size 2i2^i2i and hop size 2i22^{i-2}2i2. Adversarial training with multiple discriminators further enhances perceptual quality, following the loss formulations from XY-Tokenizer. The overall generator objective combines semantic, reconstruction, commitment, codebook, adversarial, and feature matching losses with learnable weights:

LG=λsemLsem+λrecLrec+λcmtLcmt+λcodeLcode+λadvLadv+λfeatLfeat.\mathcal { L } _ { \mathrm { G } } = \lambda _ { \mathrm { s e m } } \mathcal { L } _ { \mathrm { s e m } } + \lambda _ { \mathrm { r e c } } \mathcal { L } _ { \mathrm { r e c } } + \lambda _ { \mathrm { c m t } } \mathcal { L } _ { \mathrm { c m t } } + \lambda _ { \mathrm { c o d e } } \mathcal { L } _ { \mathrm { c o d e } } + \lambda _ { \mathrm { a d v } } \mathcal { L } _ { \mathrm { a d v } } + \lambda _ { \mathrm { f e a t } } \mathcal { L } _ { \mathrm { f e a t } } .LG=λsemLsem+λrecLrec+λcmtLcmt+λcodeLcode+λadvLadv+λfeatLfeat.

All components—encoder, quantizer, decoder, LLM, and discriminators—are optimized jointly in an end-to-end fashion.

For end-to-end autoregressive speech generation, the authors build CAT-TTS, which directly predicts CAT’s RVQ tokens from text and speaker prompts. The model employs a Temporal Transformer to capture long-range dependencies across time and a Depth Transformer to model the coarse-to-fine structure within each time step. The Depth Transformer autoregressively predicts RVQ tokens conditioned only on previous time steps and preceding layers at the current step, preserving strict causality.

To enable variable-bitrate synthesis within a single model, the authors introduce Progressive Sequence Dropout. During training, with probability ppp, a random prefix length K{1,,Nq1}K \in \{1, \ldots, N_q - 1\}K{1,,Nq1} is sampled, and RVQ tokens from layers K+1K+1K+1 to NqN_qNq are dropped. The effective number of active layers is defined as:

K^=(1z)Na+zK,\hat { K } = \left( 1 - z \right) N _ { a } + z \, K ,K^=(1z)Na+zK,

where zBernoulli(p)z \sim \mathrm{Bernoulli}(p)zBernoulli(p). The Temporal Transformer receives aggregated embeddings from the first K^\hat{K}K^ layers at each time step:

e~t=k=1K^Embk(qt,k).\tilde { \mathbf { e } } _ { t } = \sum _ { k = 1 } ^ { \hat { K } } \mathbf { E m b } _ { k } ( \mathbf { q } _ { t , k } ) .e~t=k=1K^Embk(qt,k).

The training loss is computed only over the retained prefix:

L=t=1Tk=1K^logpθ(qt,kx,q<t,qt,<k).\mathcal { L } = - \sum _ { t = 1 } ^ { T } \sum _ { k = 1 } ^ { \hat { K } } \log p _ { \theta } \Big ( \mathbf { q } _ { t , k } \mid \mathbf { x } , \, \mathbf { q } _ { < t } , \, \mathbf { q } _ { t , < k } \Big ) .L=t=1Tk=1K^logpθ(qt,kx,q<t,qt,<k).

At inference, the synthesis bitrate is controlled by selecting an inference depth KinferK_{\text{infer}}Kinfer. The Temporal Transformer processes the first KinferK_{\text{infer}}Kinfer RVQ streams, and the Depth Transformer predicts only those layers. The resulting tokens are decoded into waveforms using the CAT decoder, which is inherently robust to varying bitrates due to quantizer dropout during training.

As shown in the figure below, the Temporal Transformer processes text and aggregated audio token embeddings over time, while the Depth Transformer autoregressively predicts RVQ tokens across layers, with dropout enabling variable-depth generation.

Experiment

  • MOSS-Audio-Tokenizer excels in speech reconstruction across all bitrates, outperforming prior methods at low bitrates and achieving state-of-the-art results at medium and high bitrates; it maintains competitive performance on general audio and music, with quality improving as bitrate increases.
  • Progressive Sequence Dropout significantly enhances robustness of the TTS system under reduced bitrates, enabling stable performance across varying dropout rates while reducing training memory usage; p=1.0 is adopted for optimal efficiency without quality loss.
  • CAT-TTS surpasses prior discrete autoregressive TTS models in speaker similarity and matches top systems like IndexTTS2 and VoxCPM in word error rate, achieving the highest speaker similarity scores on Seed-TTS-Eval for both English and Chinese, validating its effectiveness for zero-shot generation.
  • End-to-end optimization is critical for CAT’s scalability, enabling continuous improvement in reconstruction quality with increased training, unlike partial optimization which plateaus early due to frozen components.
  • Model parameters and quantization capacity must scale together; larger models benefit at high bitrates but underperform at low bitrates, revealing bitrate as the primary bottleneck—optimal scaling requires synchronized expansion of both.
  • Reconstruction fidelity consistently improves with increased training batch size, showing predictable scaling behavior where higher throughput directly translates to higher quality within the same training steps.
  • Subjective evaluations confirm MOSS-Audio-Tokenizer delivers high perceptual quality across bitrates, outperforming variable-bitrate tokenizers at low bitrates and matching specialized tokenizers at their target bitrates.
  • CAT tokens support effective speech understanding: CAT-ASR achieves competitive WER and CER on English and Chinese benchmarks when fed directly into an LLM, demonstrating strong alignment with text and sufficient linguistic content preservation.

The authors use CAT-TTS, a fully autoregressive discrete token-based system, to achieve state-of-the-art speaker similarity scores on both English and Chinese benchmarks while maintaining low word error rates. Results show that CAT-TTS outperforms prior discrete autoregressive models and matches or exceeds performance of recent cascaded and non-autoregressive systems, demonstrating the effectiveness of its unified discrete interface for high-quality zero-shot speech generation. The system also supports variable bitrate control, enabling flexible synthesis without compromising fidelity.

The authors use MOSS-Audio-Tokenizer to achieve strong reconstruction across speech, sound, and music at multiple bitrates, outperforming prior methods especially at low and medium bitrates while maintaining scalability through end-to-end optimization. Results show that its transformer-based architecture, variable bitrate support, and semantic richness contribute to consistent high-quality reconstruction without relying on pretrained encoders. The model’s design enables robust performance across domains and bitrates, distinguishing it from other tokenizers that either lack end-to-end training or fail to scale effectively.

The authors use CAT tokens as direct inputs to a large language model for automatic speech recognition, achieving competitive word and character error rates on English and Chinese benchmarks. Results show that CAT tokens preserve sufficient linguistic content and align well with text, enabling effective speech understanding without additional alignment or auxiliary supervision.

MOSS-Audio-Tokenizer consistently outperforms or matches other open-source audio tokenizers across speech, general audio, and music benchmarks at low, medium, and high bitrates, with performance improving as bitrate increases. The model demonstrates strong scalability through end-to-end optimization, maintaining high reconstruction fidelity even under variable bitrate conditions. Its design enables robust performance across diverse audio types without requiring separate models for different bitrates or domains.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp