HyperAIHyperAI

Command Palette

Search for a command to run...

비트던스: 이진 토큰을 활용한 순차 생성 모델의 확장

Yuang Ai Jiaming Han Shaobin Zhuang Weijia Mao Xuefeng Hu Ziyan Yang Zhenheng Yang Huaibo Huang Xiangyu Yue Hao Chen

초록

우리는 코드북 인덱스 대신 이진 시각 토큰을 예측하는 확장 가능한 자기회귀(AR) 이미지 생성기인 BitDance를 제안한다. 고엔트로피 이진 잠재변수를 사용함으로써, BitDance는 각 토큰이 최대 2²⁵⁶개의 상태를 표현할 수 있게 하여, 컴팩트하면서도 매우 표현력이 풍부한 이산적 표현을 가능하게 한다. 이러한 거대한 토큰 공간에서 표준 분류 방식을 사용한 샘플링은 어렵다. 이를 해결하기 위해 BitDance는 이진 확산 헤드(binary diffusion head)를 도입한다. 소프트맥스를 통해 인덱스를 예측하는 대신, 연속 공간에서의 확산 기법을 활용하여 이진 토큰을 생성한다. 더불어, 병렬적으로 다수의 토큰을 고정확도로 예측하는 새로운 디코딩 방법인 다음 패치 확산(next-patch diffusion)을 제안한다. 이는 추론 속도를 크게 향상시킨다. ImageNet 256×256에서 BitDance는 AR 모델 중 최고 수준의 FID 1.24를 달성하였다. 다음 패치 확산 기법을 사용함으로써, 14억 파라미터를 사용하는 최신 병렬 AR 모델을 능가하면서도 파라미터 수는 5.4배 적은 2.6억(260M), 추론 속도는 8.7배 빠른 성능을 보였다. 텍스트-이미지 생성에 있어서 BitDance는 대규모 다중모달 토큰으로 학습하여 고해상도이며 사진처럼 사실적인 이미지를 효율적으로 생성하며, 강력한 성능과 유리한 스케일링 특성을 보였다. 1024×1024 이미지를 생성할 때 BitDance는 기존 AR 모델 대비 30배 이상의 속도 향상을 달성하였다. 본 연구에서는 AR 기반 모델에 대한 추가 연구를 촉진하기 위해 코드와 모델을 공개한다. 코드와 모델은 다음 링크에서 제공된다: https://github.com/shallowdream204/BitDance.

One-sentence Summary

Researchers from ByteDance, CUHK, and CAS propose BitDance, an AR image generator using binary tokens and diffusion heads for efficient sampling, achieving state-of-the-art FID with 5.4× fewer parameters and 8.7× speedup, excelling in high-resolution text-to-image generation.

Key Contributions

  • BitDance introduces a scalable autoregressive image generator that predicts binary visual tokens with up to 22562^{256}2256 states per token, enabling high-fidelity reconstruction while avoiding codebook collapse and quantization errors common in discrete VQ-based approaches.
  • To efficiently sample from this massive token space, BitDance replaces softmax classification with a binary diffusion head and introduces next-patch diffusion, allowing parallel multi-token prediction that reduces parameter count and accelerates inference by up to 8.7× compared to prior parallel AR models.
  • Evaluated on ImageNet 256×256 and text-to-image tasks, BitDance achieves state-of-the-art FID of 1.24 among AR models and generates 1024×1024 images with over 30× speedup, while using 5.4× fewer parameters than competing 1.4B-parameter models.

Introduction

The authors leverage binary visual tokens—instead of discrete codebook indices—to build BitDance, a highly scalable autoregressive image generator that achieves compact, high-entropy representations capable of encoding up to 2^256 states per token. Prior discrete tokenizers suffer from quantization errors and memory-heavy entropy losses, while continuous token approaches in AR models face error accumulation and drift during long-sequence generation. BitDance overcomes these by introducing a binary diffusion head for token generation and next-patch diffusion for parallel multi-token prediction, enabling 8.7x faster inference than prior parallel AR models while using 5.4x fewer parameters and achieving state-of-the-art FID scores. It also scales efficiently to 1024x1024 text-to-image generation with over 30x speedup, making high-fidelity autoregressive generation practical for large-scale applications.

Method

The authors leverage a novel autoregressive architecture called BitDance, designed to scale token entropy for high-fidelity visual generation while maintaining computational efficiency. The core innovation lies in replacing conventional discrete classification heads with a continuous-space diffusion-based prediction mechanism, enabling precise and scalable sampling of binary visual tokens.

The framework begins with a binary visual tokenizer based on Lookup-Free Quantization (LFQ), which maps continuous latent vectors xRdx \in \mathbb{R}^dxRd to binary tokens via xq=sign(x)x_q = \mathrm{sign}(x)xq=sign(x). To prevent codebook collapse and maximize entropy, a group-wise entropy loss is applied, partitioning the ddd channels into ggg groups for tractable computation. This allows scaling the codebook size up to 22562^{256}2256, achieving reconstruction fidelity comparable to continuous VAEs.

For token prediction, BitDance introduces a binary diffusion head that models the joint distribution of binary tokens in continuous space. Instead of predicting discrete indices, the model treats each ddd-bit token as a point on a ddd-dimensional hypercube. Given a hidden state zRhz \in \mathbb{R}^hzRh, the diffusion head learns to predict the target token xRdx \in \mathbb{R}^dxRd by optimizing a velocity-matching loss:

L(z,x)=Et,x,ϵvθ(xt,t,z)vt2,\mathcal{L}(z, x) = \mathbb{E}_{t, x, \epsilon} \left\| v_{\theta}(x_t, t, z) - v_t \right\|^2,L(z,x)=Et,x,ϵvθ(xt,t,z)vt2,

where xt=tx+(1t)ϵx_t = tx + (1 - t)\epsilonxt=tx+(1t)ϵ and vt=xϵv_t = x - \epsilonvt=xϵ. During inference, the model initializes x0N(0,I)x_0 \sim \mathcal{N}(0, \mathbf{I})x0N(0,I) and integrates the velocity field using an Euler solver over NNN steps, followed by a hard binarization x1=sign(x1)x_1 = \mathrm{sign}(x_1)x1=sign(x1) to project the output back onto the hypercube vertices. This geometric constraint enhances stability and convergence.

Refer to the framework diagram, which illustrates the architecture of BitDance trained on multi-modal tokens. An input image is encoded into binary latents, flattened into a 1D sequence via patch-wise raster scan, and processed by an autoregressive transformer. The binary diffusion head enables efficient parallel prediction of multiple tokens per step.

To accelerate generation, BitDance employs a next-patch diffusion paradigm. Rather than predicting tokens one by one, the model partitions the sequence into MMM patches of p×pp \times pp×p tokens and predicts each patch jointly. This is implemented via a block-wise causal attention mask that allows intra-patch tokens to attend to each other while preserving autoregressive dependencies across patches. To support this, p21p^2 - 1p21 learnable prefix tokens are prepended to the visual sequence.

The binary diffusion head is extended to support multi-token prediction by adapting the velocity-matching loss to a patch of tokens XRp2×dX \in \mathbb{R}^{p^2 \times d}XRp2×d conditioned on hidden states ZRp2×hZ \in \mathbb{R}^{p^2 \times h}ZRp2×h:

Lparallel=Et,X,ϵvθ(Xt,t,Z)vt2.\mathcal{L}_{parallel} = \mathbb{E}_{t, X, \epsilon} \left\| v_{\theta}(X_t, t, Z) - v_t \right\|^2.Lparallel=Et,X,ϵvθ(Xt,t,Z)vt2.

The prediction network fθf_\thetafθ is implemented as a lightweight DiT to efficiently model the joint distribution of p2p^2p2 tokens. This design bridges the training-inference gap present in prior parallel AR models, which rely on independent token sampling despite modeling joint distributions.

As shown in the figure below, the binary diffusion head enables joint token sampling, in contrast to parallel classification heads that sample tokens independently, violating the required joint distribution.

For text-to-image generation, BitDance is built atop a pretrained LLM (Qwen3-14B), serving as both text encoder and image generator. The model is trained using a three-stage pipeline: pre-training, continued training, and supervised fine-tuning, with mixed-resolution training (256px, 512px, 1024px) to ensure stability and generalization. A distillation stage further increases parallelism from 16 to 64 tokens per step. The training loss combines vision loss from the diffusion head and text loss (cross-entropy) with a 1:0.01 weighting, and classifier-free guidance is enabled via 10% token dropout.

The binary diffusion head’s effectiveness stems from the geometric structure of binary tokens as hypercube vertices. This finite, orientation-constrained space provides strong regularization, reducing optimization complexity and improving sampling stability compared to unconstrained continuous latent spaces. As illustrated in the comparison of binary token sampling paradigms, the diffusion head achieves both scaling efficiency and precise sampling, overcoming the limitations of index-based and bit-wise classification approaches.

Experiment

  • Scaling token entropy improves discrete tokenizer reconstruction, matching or exceeding continuous models, especially with larger codebooks and downsampling ratios.
  • Larger autoregressive Transformers better exploit increased vocabulary sizes, showing that model scale must grow alongside token entropy for optimal generative performance.
  • BitDance achieves state-of-the-art or near-state-of-the-art results on ImageNet and text-to-image benchmarks, outperforming larger parallel AR models while using fewer parameters and less training data.
  • The binary diffusion head enables efficient parallel token generation, maintains high quality with few sampling steps, and implicitly learns discrete distributions without explicit constraints.
  • Ablations confirm the superiority of binary tokenization over continuous VAEs, the effectiveness of patch-wise raster scan and block causal masks, and the limitations of alternative sampling heads.
  • BitDance demonstrates strong data efficiency, matching proprietary models despite training on orders of magnitude less data, and excels in prompt following, text fidelity, and inference speed.

Results show that BitDance achieves state-of-the-art performance among autoregressive models on text-to-image generation benchmarks, matching or exceeding many proprietary and diffusion-based systems despite being trained on significantly less data. The model demonstrates strong capabilities in prompt following, text rendering, and reasoning across multiple evaluation dimensions. Its efficiency and scalability allow it to close the performance gap between open-source and commercial models while maintaining faster inference speeds.

The authors evaluate different sampling heads for autoregressive image generation and find that the binary diffusion head significantly outperforms alternatives in both FID and Inception Score. While token classification fails due to memory constraints and bitwise classification underperforms from oversimplified assumptions, the binary diffusion head achieves superior quality by effectively modeling discrete token distributions. Results confirm that this design enables robust, high-fidelity generation with minimal sampling steps.

The authors evaluate two variants of their text-to-image model—SFT and Distilled—on benchmark metrics, showing that the Distilled version, which generates 64 tokens in parallel, achieves a slightly higher DPG-Bench score while maintaining comparable GenEval performance. This indicates that distillation preserves generation quality while enabling faster inference through increased parallelism.

Results show that BitDance achieves state-of-the-art performance among autoregressive models on text-to-image generation benchmarks, matching or exceeding leading proprietary and diffusion models in key capabilities like prompt alignment and text rendering. The model attains these results despite being trained on significantly less data, highlighting its efficiency and strong generalization. Its performance across multiple evaluation dimensions underscores the effectiveness of its architecture in bridging the gap between open-source and commercial systems.

The authors evaluate BitDance against other image generation models and find it achieves significantly lower latency despite matching the parameter scale of larger autoregressive models. Results show BitDance generates 1024×1024 images faster than both diffusion and autoregressive baselines, highlighting its efficiency in inference speed. This performance is achieved without compromising generative quality, as demonstrated in prior benchmarks.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp