HyperAIHyperAI

Command Palette

Search for a command to run...

PrismAudio: 비디오-오디오 생성을 위한 분해된 연쇄 사고와 다차원 보상

Huadai Liu Kaicheng Luo Wen Wang Qian Chen Peiwen Sun Rongjie Huang Xiangang Li Jieping Ye Wei Xue

초록

비디오-오디오 (V2A) 생성은 의미적 일관성, 오디오 - 비디오 시간적 동기화, 미적 품질, 공간적 정확도라는 네 가지 핵심 지각 차원의 균형을 요구합니다. 그러나 기존 방법들은 단일 손실 함수 내에서 경쟁 목표를 혼동하는 객관적 얽힘 (objective entanglement) 문제와 인간 선호도 정렬 부재를 겪고 있습니다. 본 논문에서는 특수화된 사고 연쇄 (Chain-of-Thought, CoT) 계획을 Reinforcement Learning (RL) 에 통합한 최초의 프레임워크인 PrismAudio 를 제안합니다. 본 접근법은 단일화된 추론을 의미, 시간, 미적, 공간적 CoT 모듈로 구성되는 네 가지 전문화된 CoT 모듈로 분해하며, 각 모듈은 표적 보상 함수와 짝을 이룹니다. 이러한 CoT-보상 대응 관계는 모든 관점에 걸쳐 보다 우수한 추론을 공동으로 생성하도록 모델을 유도하는 다차원 RL 최적화를 가능하게 하여, 객관적 얽힘 문제를 해결하면서도 해석 가능성을 유지합니다. 이러한 최적화를 계산적으로 실용화하기 위해, 기존 GRPO 구현 대비 훈련 오버헤드를 획기적으로 줄이는 하이브리드 ODE-SDE 샘플링을 적용한 Fast-GRPO 를 제안합니다. 또한, 기존 데이터셋보다 분포적 균형이 더 우수하고 현실적으로 다양하고 도전적인 시나리오를 포괄하며 300 개의 단일 이벤트 클래스와 501 개의 다중 이벤트 샘플을 포함하는 엄격한 벤치마크인 AudioCanvas 를 도입합니다. 실험 결과는 PrismAudio 가 도메인 내 VGGSound 테스트 세트와 도메인 외 AudioCanvas 벤치마크 모두에서 네 가지 지각 차원 전반에 걸쳐 최첨단 (state-of-the-art) 성능을 달성함을 보여줍니다.

One-sentence Summary

Researchers from HKUST, Alibaba Group, and CUHK introduce PrismAudio, the first framework to integrate reinforcement learning into video-to-audio generation via specialized chain-of-thought planning that decomposes reasoning into semantic, temporal, aesthetic, and spatial modules paired with targeted rewards to resolve objective entanglement while preserving interpretability, alongside Fast-GRPO for reduced training overhead and the AudioCanvas benchmark with 300 single-event classes and 501 multi-event samples.

Key Contributions

  • We introduce PrismAudio, the first framework to integrate Reinforcement Learning into video-to-audio generation with specialized Chain-of-Thought planning. This approach decomposes reasoning into four specialized CoT modules paired with targeted reward functions to address objective entanglement while preserving interpretability.
  • To ensure computational practicality, we propose Fast-GRPO, an optimization method employing hybrid ODE-SDE sampling. This technique dramatically reduces training overhead compared to existing GRPO implementations.
  • We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets. It includes 300 single-event classes and 501 multi-event samples to support evaluation.

Introduction

Video-to-Audio generation requires balancing semantic consistency, temporal synchrony, aesthetic quality, and spatial accuracy to synthesize soundscapes from silent videos. Existing methods suffer from objective entanglement where competing goals are conflated in single loss functions, and they lack human preference alignment. Recent Chain-of-Thought approaches further fail due to monolithic planning that cannot address distinct perceptual dimensions independently. The authors introduce PrismAudio, which integrates Reinforcement Learning with specialized Chain-of-Thought modules for each perceptual axis. This decomposition enables multi-dimensional RL optimization to guide reasoning across all perspectives while preserving interpretability. To reduce training overhead, they propose Fast-GRPO using hybrid ODE-SDE sampling. Additionally, the team presents AudioCanvas, a rigorous benchmark for diverse scenarios, achieving state-of-the-art performance across all dimensions.

Dataset

Dataset overview
Dataset overview
  • The authors construct AudioCanvas using 300 distinct categories from the AudioSet ontology, focusing on sound effects and music while excluding human speech and singing.
  • The final benchmark comprises 3,177 high-quality videos, including a curated subset of 501 multi-event videos designed to evaluate complex scene interactions.
  • Filtering protocols automatically discard samples where existing V2A models achieve near-perfect scores, while professional experts manually screen for diversity and audio-visual correlation.
  • Gemini 2.5 Pro generates structured Chain-of-Thought captions covering semantic, temporal, aesthetic, and spatial dimensions, which are subsequently decoupled into separate modules by a text LLM.
  • The dataset supports both advanced benchmarking and fine-tuning for models like VideoLLaMA2, with access restricted to academic researchers via a formal application process.
  • Privacy and ethical standards are maintained by providing reference links instead of raw video redistribution and using anonymized identifiers for all content.

Method

The proposed method, PrismAudio, operates through a three-stage pipeline comprising a CoT-aware audio foundation model, customized CoT modules for reasoning decomposition, and a GRPO post-training framework. The overall architecture integrates these components to enable high-quality video-to-audio generation with multi-dimensional reasoning capabilities.

Overview of the PrismAudio framework, illustrating the CoT data construction on the left and the Fast-GRPO multi-dimensional CoT-RL training pipeline on the right.
Overview of the PrismAudio framework, illustrating the CoT data construction on the left and the Fast-GRPO multi-dimensional CoT-RL training pipeline on the right.

CoT-Aware Audio Foundation Model The core generation model is built upon a diffusion transformer backbone utilizing flow matching. To overcome the limitations of existing models in handling complex video scenarios and structured reasoning text, the authors implement two key architectural enhancements. First, they replace standard CLIP-based encoders with VideoPrism, a state-of-the-art video encoder designed to capture rich semantic representations of objects, actions, and environmental contexts. Second, to effectively condition the model on the structured reasoning patterns required for multi-dimensional CoT, the standard T5 encoder is upgraded to T5-Gemma. This encoder-decoder architecture adapts the reasoning capabilities of decoder-only LLMs, enabling robust comprehension of the analytical text generated by the CoT modules.

Decomposing Multi-dimensional CoT Reasoning To address the limitations of monolithic reasoning paths, the method decomposes video-to-audio reasoning into four specialized dimensions: Semantic, Temporal, Aesthetic, and Spatial. The Semantic CoT identifies audio events and characteristics, while the Temporal CoT determines the sequential ordering of these events. The Aesthetic CoT focuses on quality aspects like naturalness and fidelity, and the Spatial CoT analyzes sound positioning and distance. High-quality training data for these dimensions is constructed using Gemini 2.5 Pro. This data is then used to fine-tune VideoLLaMA2, enabling it to generate the four specialized CoTs. These distinct reasoning texts are concatenated to form a multi-dimensional CoT, which serves as enhanced structured text conditioning for the audio foundation model.

Fast-GRPO Post-Training Framework The final stage aligns the audio foundation model with multi-dimensional human preferences using a Fast-GRPO framework. This process involves four specialized reward functions corresponding to the CoT dimensions: Semantic Reward (measured by MS-CLAP), Temporal Reward (assessed via Synchformer), Aesthetic Reward (using Meta Audiobox Aesthetics), and Spatial Reward (employing StereoCRW). To optimize efficiently across these objectives, the authors introduce a mixed sampler with random-window scheduling. While the flow matching generation is inherently deterministic (an ODE), it is reformulated as a stochastic process (an SDE) to enable RL-based optimization. The Fast-GRPO algorithm strategically confines stochasticity to a small, randomly placed window of timesteps within the generation trajectory. For each training iteration, a starting position \ell is sampled to define an optimization window W()\mathcal{W}(\ell)W() with width wTw \ll TwT:

W()  =  {,+1,,+w1}.\mathcal { W } ( \ell ) \; = \; \{ \ell , \ell + 1 , \ldots , \ell + w - 1 \} .W()={,+1,,+w1}.

The generation process interleaves deterministic ODE steps and stochastic SDE steps based on this window. For a step size Δt\Delta tΔt, the update rule is:

xt+1={xt+vθ(xt,t,c)Δt,if tW()(ODE step)xt+μSDE(xt,t,c)Δt+σtΔtεt,if tW()(SDE step)\mathbf { x } _ { t + 1 } \, = \, \left\{ \begin{array} { l l } { \mathbf { x } _ { t } + v _ { \theta } ( \mathbf { x } _ { t } , t , c ) \Delta t , } & { \mathrm { i f ~ } t \notin \mathcal { W } ( \ell ) \quad \mathrm { ( O D E ~ s t e p ) } } \\ { \mathbf { x } _ { t } + \mu _ { \mathrm { S D E } } ( \mathbf { x } _ { t } , t , c ) \Delta t \, + \, \sigma _ { t } \sqrt { \Delta t } \varepsilon _ { t } , } & { \mathrm { i f ~ } t \in \mathcal { W } ( \ell ) \quad \mathrm { ( S D E ~ s t e p ) } } \end{array} \right.xt+1={xt+vθ(xt,t,c)Δt,xt+μSDE(xt,t,c)Δt+σtΔtεt,if t/W()(ODE step)if tW()(SDE step)

where εtN(0,I)\varepsilon_t \sim \mathcal{N}(0, I)εtN(0,I) and vθv_\thetavθ is the model's predicted velocity. This hybrid approach allows for tractable policy ratio computation and reduces the Number of Function Evaluations (NFE) per sample, enabling near-linear complexity training. The policy model is optimized by maximizing the following objective, derived from the Fast-GRPO formulation restricted to the selected SDE steps:

JFastGRPO(θ) = Ec,,{xi}πθold[1Ni=1N1wtW()min(rti(θ)Ai,clip(rti(θ),1ε,1+ε)Ai)].\mathcal { J } _ { \mathrm { F a s t - G R P O } } ( \theta ) \ = \ \mathbb { E } _ { c , \ell , \{ \mathbf { x } ^ { i } \} \sim \pi _ { \theta _ { \mathrm { o l d } } } } \left[ \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \frac { 1 } { w } \sum _ { t \in \mathcal { W } ( \ell ) } \operatorname* { m i n } \Big ( r _ { t } ^ { i } ( \theta ) \, A ^ { i } , \, \mathrm { c l i p } ( r _ { t } ^ { i } ( \theta ) , 1 - \varepsilon , 1 + \varepsilon ) \, A ^ { i } \Big ) \right] .JFastGRPO(θ) = Ec,,{xi}πθoldN1i=1Nw1tW()min(rti(θ)Ai,clip(rti(θ),1ε,1+ε)Ai).

where AiA^{i}Ai is the group-normalized advantage computed from the weighted sum of the multi-dimensional rewards.

Experiment

The study evaluates performance on the VGGSound test set and the newly introduced AudioCanvas benchmark using comprehensive objective metrics and subjective Mean Opinion Scores to assess semantic, temporal, spatial, and aesthetic dimensions. Results demonstrate that PrismAudio achieves state-of-the-art performance by employing a multi-dimensional Chain-of-Thought reinforcement learning framework that effectively balances competing perceptual objectives, while ablation analyses confirm that structured decomposed reasoning is essential for maintaining robustness in complex scenarios. Qualitative comparisons further highlight the model's superior ability to preserve high-frequency details and accurate transient responses compared to existing methods.

The the the table compares various Chain-of-Thought reasoning strategies, ranging from a baseline without reasoning to the proposed multi-dimensional approach. Results demonstrate that structured, decomposed reasoning significantly improves performance across semantic, temporal, and aesthetic metrics compared to unstructured or single-block methods. The MultiCoT method achieves the best overall scores, validating the necessity of logical planning for high-quality generation. MultiCoT achieves the highest semantic alignment and temporal synchrony scores Structured reasoning strategies outperform random or unstructured approaches Decomposed reasoning yields better aesthetic quality than monolithic methods

Multi-dimensional CoT reasoning outperforms baselines
Multi-dimensional CoT reasoning outperforms baselines

The authors evaluate the proposed method against competitive baselines on the VGGSound test set. Results show that PrismAudio achieves superior performance across semantic, temporal, and aesthetic dimensions compared to prior models. The ablation study further indicates that the CoT-RL framework provides substantial improvements over the foundation model. PrismAudio secures the highest subjective scores for audio quality and consistency. The proposed method outperforms baselines in spatial accuracy and temporal synchrony. CoT-RL optimization delivers significant performance gains over the base foundation model.

PrismAudio demonstrates superior performance across all evaluation metrics.
PrismAudio demonstrates superior performance across all evaluation metrics.

The experiment evaluates the impact of different Chain-of-Thought reasoning structures on audio generation quality. Results demonstrate that structured, decomposed reasoning significantly outperforms unstructured or monolithic approaches across semantic, temporal, and aesthetic dimensions. MultiCoT outperforms monolithic CoT in semantic understanding and aesthetic quality Baseline models without CoT reasoning perform poorly across all evaluation metrics Structured logical plans are essential for high-quality generation compared to random keyword ordering

Comparative analysis of CoT reasoning strategies
Comparative analysis of CoT reasoning strategies

The authors evaluate video encoders on a retrieval task using the AudioCanvas benchmark. VideoPrism achieves significantly higher recall scores compared to CLIP and X-CLIP across all scene complexities. The performance gap is particularly large in multi-event scenarios, highlighting VideoPrism's robustness. VideoPrism achieves the highest recall scores across all scene categories Performance advantage is most significant in complex multi-event scenes VideoPrism maintains robust retrieval accuracy while baselines degrade

VideoPrism demonstrates superior scene understanding capabilities
VideoPrism demonstrates superior scene understanding capabilities

The authors compare text encoders to validate their ability to handle structured reasoning. T5-Gemma consistently achieves better results than T5-Base and T5-Large across sequential understanding and causal logic metrics. These findings support the selection of instruction-tuned models for processing complex Chain-of-Thought descriptions. T5-Gemma achieves the highest scores in sequential understanding tasks. Causal reasoning capabilities are significantly stronger with T5-Gemma. Multi-step reasoning accuracy remains high for T5-Gemma compared to baselines.

T5-Gemma outperforms standard T5 models in reasoning tasks.
T5-Gemma outperforms standard T5 models in reasoning tasks.

The experiments validate the proposed framework by comparing it against baselines across audio generation, video retrieval, and text encoding tasks. Results indicate that structured, multi-dimensional Chain-of-Thought reasoning and CoT-RL optimization significantly enhance semantic alignment and temporal synchrony compared to unstructured approaches. Additionally, components like VideoPrism and T5-Gemma demonstrate superior robustness in complex scene retrieval and causal logic, confirming the necessity of instruction-tuned models for high-quality generation.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp