a month ago

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

Ziqi Huang Ning Yu Gordon Chen Haonan Qiu Paul Debevec Ziwei Liu

Abstract

Recent video generation models can produce smooth and visually appealingclips, but they often struggle to synthesize complex dynamics with a coherentchain of consequences. Accurately modeling visual outcomes and statetransitions over time remains a core challenge. In contrast, large language andmultimodal models (e.g., GPT-4o) exhibit strong visual state reasoning andfuture prediction capabilities. To bridge these strengths, we introduce VChain,a novel inference-time chain-of-visual-thought framework that injects visualreasoning signals from multimodal models into video generation. Specifically,VChain contains a dedicated pipeline that leverages large multimodal models togenerate a sparse set of critical keyframes as snapshots, which are then usedto guide the sparse inference-time tuning of a pre-trained video generator onlyat these key moments. Our approach is tuning-efficient, introduces minimaloverhead and avoids dense supervision. Extensive experiments on complex,multi-step scenarios show that VChain significantly enhances the quality ofgenerated videos.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

Ziqi Huang Ning Yu Gordon Chen Haonan Qiu Paul Debevec Ziwei Liu

Abstract

Build AI with AI

Hyper Newsletters