HyperAIHyperAI

Command Palette

Search for a command to run...

OmniForcing: 실시간 오디오-비주얼 동시 생성의 잠재력 해방

Yaofeng Su Yuming Li Zeyue Xue Jie Huang Siming Fu Haoran Li Ying Li Zezhong Qian Haoyang Huang Nan Duan

초록

최근의 결합형 오디오-비주얼 확산 (diffusion) 모델은 뛰어난 생성 품질을 달성했으나, 양방향 주의 (bidirectional attention) 의존성으로 인한 높은 지연 시간으로 인해 실시간 응용에 제한을 받습니다. 이에 우리는 오프라인 듀얼스트림 양방향 확산 모델을 고정밀 스트리밍 자기회귀 (autoregressive) 생성기로 증류하는 최초의 프레임워크인 'OmniForcing'을 제안합니다. 그러나 이러한 듀얼스트림 아키텍처에 단순한 인과적 증류 (causal distillation) 를 적용할 경우, 모달리티 간 극단적인 시간적 비대칭성과 이로 인한 토큰 희소성으로 인해 심각한 학습 불안정이 발생합니다. 본 연구는 이러한 본질적인 정보 밀도 격차를 해결하기 위해, 다중모달 동기화 드리프트를 방지하는 제로 절단 (zero-truncation) 글로벌 프리픽스를 포함한 비대칭 블록 인과 정렬 (Asymmetric Block-Causal Alignment) 기법을 도입합니다. 또한, 인과적 전환 과정에서 발생하는 극단적인 오디오 토큰 희소성으로 인한 그래디언트 폭발 문제를 해결하기 위해, 아이덴티티 RoPE(위치 임베딩) 제약이 적용된 오디오 싱크 토큰 (Audio Sink Token) 메커니즘을 설계하였습니다. 마지막으로, Joint Self-Forcing 증류 패러다임을 통해 모델은 긴 롤아웃 (long rollouts) 동안 노출 편향 (exposure bias) 으로 인한 누적된 교차모달 오류를 동적으로 자가 수정할 수 있습니다. 모달리티에 독립적인 롤링 KV-캐시 추론 방식을 통해 구동되는 OmniForcing 은 단일 GPU 에서 약 25 FPS 의 스트리밍 생성 성능을 달성하며, 양방향 교사 모델과 동급의 다중모달 동기화 및 시각적 품질을 유지합니다. 프로젝트 페이지: https://omniforcing.com

One-sentence Summary

Researchers from JD Explore Academy, Fudan University, Peking University, and the University of Hong Kong propose OmniForcing, a framework that distills bidirectional audio-visual diffusion models into real-time streaming generators. By introducing asymmetric block-causal alignment and audio sink tokens, it overcomes training instability to achieve 25 FPS generation while preserving high-fidelity synchronization.

Key Contributions

  • OmniForcing addresses the high latency of bidirectional joint audio-visual diffusion models by distilling them into a high-fidelity streaming autoregressive generator that enables real-time applications.
  • The framework introduces Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix and an Audio Sink Token mechanism to resolve training instability caused by extreme temporal asymmetry and token sparsity.
  • By employing a Joint Self-Forcing Distillation paradigm and a Modality-Independent Rolling KV-Cache, the method achieves state-of-the-art streaming generation at approximately 25 FPS on a single GPU while maintaining synchronization and visual quality comparable to the teacher model.

Introduction

Recent joint audio-visual diffusion models like LTX-2 deliver high-fidelity synchronized content but rely on bidirectional attention that requires processing the entire timeline at once. This architecture creates prohibitive latency and prevents real-time streaming, while existing workarounds either decouple modalities to degrade quality or fail to stabilize when applied to dual-stream systems due to extreme token sparsity and temporal asymmetry. The authors introduce OmniForcing, the first framework to distill an offline bidirectional model into a high-fidelity streaming autoregressive generator. They resolve training instability through an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix and an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Additionally, a Joint Self-Forcing Distillation paradigm allows the model to self-correct cumulative errors, enabling state-of-the-art streaming generation at approximately 25 FPS on a single GPU.

Method

The authors leverage a dual-stream transformer backbone to enable real-time, streaming joint generation of temporally aligned video and audio. The overall framework, depicted in the OmniForcing pipeline, restructures a pretrained bidirectional model into a block-causal autoregressive system. This process involves a three-stage distillation pipeline designed to transfer the teacher's high-fidelity joint distribution to an ultra-fast causal engine.

The training process follows a sequential distillation paradigm to smoothly decouple few-step denoising from the causal generation paradigm. Stage I employs Bidirectional Distribution Matching Distillation (DMD) to adapt the model for few-step denoising while preserving the global receptive field. Stage II utilizes causal ODE regression to adapt the network weights to the asymmetric block-causal mask, correcting the conditional distribution shift. Finally, Stage III implements joint Self-Forcing training by autoregressively unrolling the generation process to mitigate exposure bias and ensure cross-modal synchrony.

To address the extreme frequency asymmetry between video (3 FPS) and audio (25 FPS) latents, the method employs an Asymmetric Block-Causal Masking design. This approach bridges the information density gap by establishing a physical-time-based Macro-block Alignment. As shown in the figure below, the timeline is partitioned into 1-second macro-blocks, where each block encapsulates a fixed number of video and audio latents without fractional remainders.

The initial components are merged into a Global Prefix block (B0\mathcal{B}_0B0) which functions as a system prompt, remaining globally visible to all future tokens. To prevent gradient explosions and Softmax collapse caused by the sparse history in early audio blocks, the authors introduce learnable Sink Tokens prepended to the audio sequence. These tokens are anchored within the global prefix and utilize an Identity RoPE constraint to remain position-agnostic. During inference, the architecture supports asymmetric compute allocation and parallel inference through modality-independent rolling KV caches, enabling real-time synchronized generation.

Experiment

  • OmniForcing is evaluated against bidirectional and cascaded autoregressive baselines to validate its ability to achieve high-fidelity streaming audio-visual generation with real-time efficiency.
  • The method demonstrates a significant speedup over offline teacher models, enabling true streaming playback with low latency while maintaining visual and audio quality comparable to the strongest joint models.
  • Qualitative analysis confirms the model successfully generates layered sounds, synchronized speech, and complex audio blends that align precisely with visual events.
  • Ablation studies validate that Audio Sink Tokens combined with Identity RoPE are essential for stabilizing training under causal constraints, whereas alternative stabilization methods lead to convergence failures or degraded output quality.
  • Overall, the experiments confirm that OmniForcing achieves a massive reduction in inference time while preserving the perceptual fidelity and cross-modal coherence of the original bidirectional teacher.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp