HyperAIHyperAI

Command Palette

Search for a command to run...

컨텍스트 포싱: 긴 컨텍스트를 통한 일관된 순차적 비디오 생성

Shuo Chen Cong Wei Sun Sun Ping Nie Kai Zhou Ge Zhang Ming-Hsuan Yang Wenhu Chen

초록

최근 실시간 긴 영상 생성을 위한 접근 방식은 일반적으로 스트리밍 튜닝 전략을 활용하며, 짧은 컨텍스트(기억 없음)를 가진 교사 모델을 사용하여 긴 컨텍스트 학습자를 훈련하려는 시도를 한다. 이러한 프레임워크에서 학습자는 긴 롤아웃을 수행하지만, 교사는 단지 5초 단위의 짧은 창만을 이용해 지도를 제공한다. 이 구조적 불일치는 학습자-교사 간의 심각한 불일치를 초래한다. 즉, 교사가 장기적 기록에 접근할 수 없기 때문에 전반적인 시간적 의존성에 대해 학습자를 올바르게 안내할 수 없으며, 결과적으로 학습자의 컨텍스트 길이에 상한선이 생긴다. 이를 해결하기 위해 우리는 긴 컨텍스트 학습자를 긴 컨텍스트 교사 모델을 통해 훈련하는 새로운 프레임워크인 컨텍스트 포싱(Context Forcing) 을 제안한다. 교사 모델이 전체 생성 이력을 인지하도록 보장함으로써, 지도 불일치를 제거하고 장기적 일관성을 갖춘 모델의 견고한 훈련을 가능하게 한다. 극단적인 지속 시간(예: 2분)에 대해 이 접근을 계산적으로 실현 가능하게 하기 위해, 선형적으로 증가하는 컨텍스트를 슬로우-패스트 메모리(Slow-Fast Memory) 아키텍처로 변환하는 컨텍스트 관리 시스템을 도입한다. 이는 시각적 중복을 크게 감소시킨다. 광범위한 실험 결과는 본 방법이 20초를 넘는 효과적인 컨텍스트 길이를 가능하게 한다는 것을 입증하며, 기존 최첨단 기법인 LongLive 및 Infinite-RoPE보다 2~10배 더 긴 컨텍스트를 달성한다. 이러한 확장된 컨텍스트를 활용함으로써, 컨텍스트 포싱은 긴 시간 동안 뛰어난 일관성을 유지하며, 다양한 긴 영상 평가 지표에서 최첨단 기준을 초월한다.

One-sentence Summary

Shuo Chen, Cong Wei, and colleagues from UC Merced and Tsinghua propose Context Forcing, a framework using long-context teachers to train students for 20s+ video generation, overcoming forgetting-drifting via Slow-Fast Memory, outperforming LongLive and Infinite-RoPE in long-term consistency.

Key Contributions

  • We identify and resolve a critical student-teacher mismatch in long video generation, where short-context teachers fail to supervise long-context students on global temporal dependencies, by introducing Context Forcing—a framework that trains students using long-context teachers aware of full generation history.
  • To enable computationally efficient training for extreme durations (e.g., 2 minutes), we design a Slow-Fast Memory architecture that compresses linearly growing context by reducing visual redundancy, allowing stable training and inference with 20+ seconds of effective context.
  • Evaluated on long video benchmarks, Context Forcing achieves 2–10× longer usable context than state-of-the-art methods like LongLive and Infinite-RoPE, significantly improving long-term consistency and outperforming baselines on key temporal coherence metrics.

Introduction

The authors leverage causal video diffusion models to tackle the challenge of generating long, temporally consistent videos—critical for applications like digital storytelling and professional editing—where prior methods suffer from either forgetting past context or drifting due to error accumulation. Existing approaches rely on short-context teachers to train long-context students, creating a mismatch that limits learnable temporal dependencies and forces a trade-off between memory and stability. Their main contribution is Context Forcing, a framework that trains a long-context student using a long-context teacher, eliminating this mismatch and enabling robust generation over 20+ seconds via a Slow-Fast Memory architecture that compresses redundant visual information while preserving global coherence.

Method

The authors leverage a two-stage curriculum within a causal autoregressive framework to train a long-context video diffusion model capable of maintaining temporal consistency over extended durations. The overall objective is to minimize the global KL divergence between the student’s induced distribution pθ(X1:N)p_{\theta}(X_{1:N})pθ(X1:N) and the real data distribution pdata(X1:N)p_{\text{data}}(X_{1:N})pdata(X1:N), where NNN spans tens to hundreds of seconds. Direct optimization of this global objective is computationally infeasible, so the authors decompose it into local dynamics Llocal\mathcal{L}_{\text{local}}Llocal and global continuation dynamics Lcontext\mathcal{L}_{\text{context}}Lcontext, enabling a tractable, staged training procedure.

In Stage 1, the student is warmed up by minimizing Llocal\mathcal{L}_{\text{local}}Llocal, which aligns the distribution of short video windows X1:kX_{1:k}X1:k (typically 1–5 seconds) with a high-quality teacher distribution pT(X1:k)p_T(X_{1:k})pT(X1:k). This is achieved via Distribution Matching Distillation (DMD), where gradients are estimated using score matching between the student and teacher models on diffused versions of generated frames. This stage ensures the student generates high-fidelity short sequences, providing stable context for the subsequent stage.

Stage 2 targets Lcontext\mathcal{L}_{\text{context}}Lcontext, which enforces alignment between the student’s continuation pθ(Xk+1:NX1:k)p_{\theta}(X_{k+1:N} \mid X_{1:k})pθ(Xk+1:NX1:k) and the true data continuation pdata(Xk+1:NX1:k)p_{\text{data}}(X_{k+1:N} \mid X_{1:k})pdata(Xk+1:NX1:k). Since the true data continuation is inaccessible for arbitrary student-generated contexts, the authors introduce a pretrained Context Teacher TTT that provides a reliable proxy pT(Xk+1:NX1:k)p_T(X_{k+1:N} \mid X_{1:k})pT(Xk+1:NX1:k). This is justified under two assumptions: (1) the teacher remains accurate when conditioned on contexts near the real data manifold, and (2) Stage 1 ensures the student’s rollouts remain within this reliable region. The resulting Contextual DMD (CDMD) objective is optimized using a conditional score-based gradient estimator, where both student and teacher scores are computed on the same student-generated context, mitigating exposure bias.

To handle the computational burden of long contexts, the authors design a Context Management System that organizes the KV cache into three functional components: an Attention Sink, Slow Memory, and Fast Memory. The Attention Sink retains initial tokens to stabilize attention, while Fast Memory acts as a rolling FIFO queue for immediate local context. Slow Memory stores high-entropy keyframes selected via a surprisal-based consolidation policy: a new token xtx_txt is promoted to Slow Memory if the similarity between its key vector ktk_tkt and the preceding key kt1k_{t-1}kt1 falls below a threshold τ\tauτ, ensuring only salient temporal transitions are retained. This architecture enables efficient context retention without linear growth in memory or attention cost.

Refer to the framework diagram, which illustrates the evolution from short-context to long-context training with memory management. The diagram shows how the student progressively learns to generate longer sequences by leveraging the teacher’s supervision and the structured memory system. The memory components are dynamically updated: Fast Memory slides through recent frames, while Slow Memory compresses salient events into a fixed-size buffer. Bounded positional encoding is applied to all tokens, constraining their RoPE indices to a fixed range regardless of generation step, thereby stabilizing attention over long sequences.

The training process further incorporates a Long Self-Rollout Curriculum, where the context horizon kkk grows linearly with training steps to gradually expose the model to long-range dependencies. A Clean Context Policy ensures that context frames X1:kX_{1:k}X1:k are fully denoised, while target frames Xk+1:NX_{k+1:N}Xk+1:N are supervised via random timestep selection, preserving gradient coverage across all diffusion steps. To enhance the robustness of the Context Teacher, the authors employ Error-Recycling Fine-Tuning, injecting realistic accumulated errors into the teacher’s context during training to ensure it can correct for student drift during inference.

Experiment

  • The robust context teacher successfully generates coherent video continuations from student-generated contexts, validating its ability to maintain long-term consistency across 10-second sequences.
  • The method achieves competitive performance on short video generation (5s) while significantly outperforming baselines in 60-second generation, particularly in preserving subject and background consistency over extended durations.
  • Ablation studies confirm that similarity-based slow memory sampling, Context DMD distillation, and bounded positional encoding are each critical for maintaining semantic and temporal coherence in long videos.
  • Error-Recycling Fine-Tuning enhances the context teacher’s robustness to accumulated generation errors, leading to cleaner rollouts and improved distillation quality.
  • Compared to LongLive and other long-video baselines, the proposed method avoids abrupt scene resets and cyclic motion artifacts, demonstrating superior qualitative stability despite comparable quantitative scores.

The authors evaluate ablation components of their video generation system, showing that their full method outperforms variants lacking key mechanisms like contextual distillation or bounded positional encoding. Results indicate that similarity-based slow memory sampling and bounded positional encoding significantly improve background and subject consistency over long sequences. The full model achieves the highest overall score, confirming the combined effectiveness of its architectural choices in maintaining temporal coherence.

The authors use a robust context teacher and student framework to generate long videos, achieving high consistency across 60-second sequences as measured by DINOv2, CLIP-F, and CLIP-T scores. Results show their method outperforms baselines like FramePack, LongLive, and Infinity-RoPE in maintaining subject and background stability over time, particularly beyond 20 seconds. Ablation studies confirm that key components—including similarity-based memory sampling, context distillation, and bounded positional encoding—are critical to sustaining long-term coherence.

The authors use a two-stage training approach with a robust context teacher to enable long video generation, achieving high consistency in both short and extended sequences. Results show their student model outperforms most baselines in background and subject consistency for 60-second videos, particularly excelling in maintaining stable semantics and structure over time. Ablation studies confirm that key components like similarity-based memory sampling and bounded positional encoding are critical for sustaining long-term coherence.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp