HyperAIHyperAI

Command Palette

Search for a command to run...

Stream-R1: 스트리밍 영상 생성을 위한 신뢰도-퍼플렉시티 기반 보상 디스틸레이션

Bin Wu Mengqi Huang Shaojin Wu Weinan Jia Yuxin Wang Zhendong Mao Yongdong Zhang

초록

디스틸레이션(distillation) 기반 가속은 자기회귀(autoregressive) 스트리밍 비디오 디퓨전 모델을 실용적으로 만드는 데 필수적인 기반이 되었으며, 그중에서도 분포 일치 디스틸레이션(Distribution Matching Distillation, DMD)이 사실상 표준으로 자리 잡았습니다. 그러나 기존 방법들은 학생 모델을教師의 출력에 무차별적으로 맞추도록 훈련시키며, 모든 로ール아웃(rollout), 프레임, 픽셀을 신뢰할 수 있는 동일한 교사 신호로 취급합니다. 본 논문은 이러한 접근 방식이 디스틸레이션의 품질 한계를 초래한다고 주장합니다. 이는 DMD 교사 신호에서의 두 가지 보편적 분산 축을 간과하기 때문입니다. 첫째는 'Inter-Reliability'(교차 신뢰도)로, 교사 신호의 신뢰도 수준이 상이한 학생 로ール아웃 간의 변이를 의미합니다. 둘째는 'Intra-Perplexity'(내부 복잡도)로, 공간적 영역과 시간적 프레임 중에서 아직 품질 향상이 가능한 곳에 불균등하게 기여하는 변이를 의미합니다. 기존의 목적 함수는 단일 가중치 아래 '각 로ール아웃에서 학습할 것인가'와 '각 로ール아웃 내에서 최적화를 어디에 집중할 것인가'라는 두 가지 질문을 혼동합니다.이에 대응하여 우리는 Stream-R1을 제안합니다. 이는 신뢰도-복잡도 인지 Reward Distillation 프레임워크로, 단일 공유 보상(reward) 기반 메커니즘을 통해 로ール아웃 수준과 시공간 요소 수준 모두에서 디스틸레이션 목적 함수의 가중치를 적응적으로 재조정합니다. Inter-Reliability 수준에서 Stream-R1은 사전 훈련된 비디오 보상 점수의 지수 함수를 사용하여 각 로ール아웃의 손실(loss)을 재스케일링하며, 이를 통해 교사 신호가 신뢰할 수 있는 로ール아웃들이 최적화 과정에서 지배적인 역할을 수행하도록 합니다. Intra-Perplexity 수준에서는 동일한 보상 모델을 역전파(back-propagation)하여 픽셀 단위 그라디언트 영합도(saliency)를 추출하고, 이를 공간 및 시간 가중치에 반영하여 최적화 압력이 예상 개선 효과가 가장 큰 영역과 프레임에 집중되도록 합니다. 적응형 균형 메커니즘은 시각적 품질, 모션 품질, 텍스트 정렬(text alignment) 등 모든 차원에서 특정 품질 축이 지배하지 않도록 방지합니다. Stream-R1은 아키텍처 수정이나 추가 추론 비용 증가 없이, 표준 스트리밍 비디오 생성 벤치마크에서 디스틸레이션 기반 접근법 대비 세 가지 차원 전반에 걸쳐 일관된 개선을 달성합니다.

One-sentence Summary

The authors propose Stream-R1, a Reliability-Perplexity Aware Reward Distillation framework for autoregressive streaming video diffusion models that employs a single shared reward-guided mechanism to adaptively reweight the distillation objective at both rollout and spatiotemporal-element levels, addressing inter-reliability and intra-perplexity variances overlooked by existing indiscriminative distribution matching distillation methods.

Key Contributions

  • The paper introduces Stream-R1, a Reliability-Perplexity Aware Reward Distillation framework that brings reward signals directly into the distribution matching distillation objective. This method adaptively reweights the distillation objective through a single shared reward-guided mechanism to address indiscriminative supervision.
  • The approach applies an Inter-Reliability scalar weight to modulate each rollout's contribution to the loss and an Intra-Perplexity per-element weight to concentrate optimization on regions with the largest expected gain. This design addresses the conflation of deciding whether to learn from each rollout and where to concentrate optimization within each rollout.
  • Results demonstrate consistent improvements on all three quality dimensions over DMD-based baselines on standard streaming video generation benchmarks. These gains are achieved without any architectural modification to the student model and at no additional inference cost.

Introduction

Autoregressive streaming video diffusion models enable unbounded generation but require distillation to mitigate prohibitive inference costs. Current Distribution Matching Distillation methods apply uniform supervision across all rollouts and pixels, overlooking variance in supervision reliability and refinement potential. The authors propose Stream-R1, a framework that leverages a pretrained reward model to adaptively reweight the distillation objective at both the rollout and spatiotemporal levels. This approach addresses Inter-Reliability by scaling loss based on reward scores and Intra-Perplexity by using gradient saliency to concentrate optimization on regions needing refinement. Stream-R1 achieves consistent quality improvements across visual, motion, and text dimensions without architectural modifications or additional inference overhead.

Method

The authors propose Stream-R1, a dynamic spatiotemporal reward-guided distillation framework designed to enhance video generation quality. As illustrated in the framework diagram, the overall training pipeline integrates a student generator with a specialized reward modulation module. The process initiates with a text prompt fed into the student generator GθG_{\theta}Gθ, which produces a video rollout. This rollout is perturbed by adding noise and subsequently evaluated by two critic networks, ffakef_{\text{fake}}ffake and fRealf_{\text{Real}}fReal, alongside the Stream R1 module. The core innovation involves modulating the distillation signal through a balanced multi-dimensional penalty and weighting the loss using an Inter-Reliability weight WinterW_{\text{inter}}Winter and an intra-instance weight WintraW_{\text{intra}}Wintra.

To address the variance in supervision reliability across different generated rollouts, the framework employs Inter-Reliability Weighting. In standard Distribution Matching Distillation, gradients are averaged equally across all rollouts, yet the reliability of these gradients varies significantly depending on the rollout's proximity to the high-quality mode. The authors assign a per-sample loss multiplier that increases with the overall reward score. This ensures that rollouts where the supervision is reliable contribute more strongly to the gradient signal. As shown in the figure below, the method transitions from uniform optimization intensity to stronger optimization intensity in high-reliability regions, effectively filtering out noisy supervision. Additionally, region-wise heat maps highlight the specific areas targeted for refinement based on student rollout perplexity.

Within individual rollouts, the authors also address Intra-Perplexity variance, as different spatial regions and temporal frames contribute unequally to the potential for quality improvement. To localize optimization pressure, the method derives a per-element weight using adaptive gradient-saliency combination. The reward model is back-propagated to compute saliency maps for different quality dimensions, such as visual quality and temporal consistency. These maps are adaptively combined to form a unified guide that prioritizes regions with larger room for improvement.

Finally, the combined saliency volume undergoes spatiotemporal decomposition. This step disentangles the spatial structure from the temporal structure by separately normalizing the components before composing them into the final weight map WintraW_{\text{intra}}Wintra. This ensures that every frame retains meaningful internal contrast while allowing the temporal weights to modulate the contribution of entire frames. To ensure balanced improvement across multiple quality dimensions, a balance penalty is introduced. This penalty discourages the optimizer from focusing disproportionately on dimensions that yield easy gains. The final objective combines these weights with the base distillation loss to produce high-quality, temporally stable videos.

Experiment

Stream-R1 is evaluated against leading baselines across short and long video benchmarks using automated metrics, VLM scoring, and human preference studies. The results show that spatiotemporal reward localization enables the distilled model to surpass its multi-step diffusion teacher and existing reward-guided methods in total quality and semantic alignment. Furthermore, ablation studies and visualizations validate that targeting localized quality deficiencies effectively mitigates temporal drift, ensuring stable backgrounds and coherent motion in extended video generation.

The authors utilize a vision-language model to evaluate long video generation quality across visual fidelity, motion dynamics, and text alignment. Results indicate that the proposed Stream-R1 method outperforms existing baselines in visual quality and text alignment metrics. Although it trails slightly in dynamic scores compared to the Reward Forcing baseline, it demonstrates a superior balanced profile across all three dimensions. Stream-R1 achieves the highest visual quality score among all compared methods. The method secures the top ranking for text alignment, surpassing the previous best model. Performance remains competitive across all axes, highlighting a balanced optimization of video attributes.

The authors benchmark Stream-R1 against representative open-source video generation models, demonstrating that it achieves the highest total performance score among all compared methods. The model notably surpasses its multi-step diffusion teacher in total and semantic quality while operating at a significantly higher inference speed. Additionally, it achieves the top quality score among autoregressive and streaming models despite having comparable parameter counts to the diffusion baseline. Stream-R1 achieves the highest total score among all compared methods, surpassing the diffusion baseline Wan2.1. The model attains the best semantic score across all methods and the highest quality score within the autoregressive category. Inference speed is drastically improved over diffusion baselines, running at a much higher frame rate with comparable parameter counts.

The authors conducted a human preference study comparing Stream-R1 against Reward Forcing on 50 long videos across five evaluation dimensions. Results indicate that Stream-R1 is preferred in all categories, with the most significant advantages observed in dynamic reasonableness and visual quality. Stream-R1 achieves the highest win rates in Dynamic Reasonableness and Visual Quality & Aesthetics compared to other dimensions. The model demonstrates a consistent preference advantage in Overall Preference against the baseline method. Temporal Consistency shows a narrower margin of preference compared to the other evaluated dimensions.

The the the table presents an ablation study analyzing the contribution of spatial and temporal reward components to video generation quality. The full model, which combines these elements, demonstrates superior performance across short and long video benchmarks compared to the baseline and intermediate variants. Furthermore, hyperparameter sensitivity analysis indicates that the specific setting for the temporal weight floor is critical, as higher values lead to performance degradation. The full model incorporating temporal reward achieves the best results across all short video metrics and long video totals. Adding temporal reward decomposition results in the largest performance gain, significantly lowering drift in long video generation. Increasing the temporal weight floor parameter negatively impacts short video performance, confirming the optimal configuration of the full model.

The authors evaluate Stream-R1 through automated vision-language assessments, benchmarking, and human preference studies to validate its long video generation capabilities. Results indicate the method outperforms existing baselines in visual fidelity and text alignment while maintaining a balanced profile across motion dynamics and semantic quality. Human evaluations confirm a consistent preference for Stream-R1 over competing approaches, particularly regarding dynamic reasonableness and aesthetics, while ablation studies establish that combining spatial and temporal reward components is critical for minimizing drift.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
Stream-R1: 스트리밍 영상 생성을 위한 신뢰도-퍼플렉시티 기반 보상 디스틸레이션 | 문서 | HyperAI초신경