HyperAIHyperAI

Command Palette

Search for a command to run...

ShotVerse:テキスト駆動型マルチショット動画生成のための映画撮影カメラ制御の進展

Songlin Yang Zhe Wang Xuyi Yang Songchun Zhang Xianghao Kong Taiyi Wu Xiaotong Zhao Ran Zhang Alan Zhao Anyi Rao

概要

テキスト駆動型動画生成は映画制作を民主化しましたが、映画的なマルチショット(複数カット)シナリオにおけるカメラ制御は依然として重大な課題です。暗黙的なテキストプロンプトは精度に欠け、明示的な軌道条件付けは手動作業のコストを過大にし、現在のモデルでは実行失敗を頻発させる傾向があります。このボトルネックを克服するため、本研究ではデータ中心のパラダイム転換を提案します。具体的には、(キャプション、軌道、動画)の整合されたトリプレットが、自動的なプロット生成と精密な実行を結びつける固有の結合分布を形成するという仮説を立てています。この洞察に基づき、私たちは「Plan-then-Control(計画後に制御)」のフレームワークである ShotVerse を提案します。このフレームワークは生成プロセスを、2 つの協調エージェントに分割します。一つは VLM(Vision-Language Model)ベースのプランナーであり、空間的事前知識を活用してテキストから映画的かつ大域的に整合した軌道を導き出し、もう一つはカメラアダプターを介してこれらの軌道をマルチショット動画コンテンツとしてレンダリングするコントローラーです。本アプローチの中核には、データ基盤の構築があります。私たちは、断片化したシングルショットの軌道を統一的な大域座標系に整合させる自動化されたマルチショットカメラ較正パイプラインを設計しました。これにより、3 つの評価プロトコルを備えた高忠実度の映画的データセット「ShotVerse-Bench」のキュレーションが可能となり、これが本フレームワークの基盤となっています。広範な実験により、ShotVerse が信頼性の低いテキスト制御と労働集約的な手動作業のプロットの間のギャップを効果的に埋め、優れた映画的審美性を達成し、カメラ精度が高くかつカット間の一貫性を備えたマルチショット動画を生成できることが示されました。

One-sentence Summary

Researchers from HKUST and Tencent Video AI Center propose ShotVerse, a Plan-then-Control framework that uses a VLM-based planner to automate cinematic trajectory generation. By introducing a novel calibration pipeline and the ShotVerse-Bench dataset, it enables precise, consistent multi-shot video creation without manual trajectory plotting.

Key Contributions

  • Current text-driven video generation struggles with precise camera control in multi-shot scenarios due to the lack of precision in implicit prompts and the high manual overhead of explicit trajectory conditioning.
  • ShotVerse introduces a "Plan-then-Control" framework that decouples generation into a VLM-based Planner for automated cinematic trajectory plotting and a Controller for precise multi-shot video rendering.
  • The authors establish a data foundation by designing an automated calibration pipeline to create ShotVerse-Bench, a high-fidelity dataset that enables the framework to achieve superior camera accuracy and cross-shot consistency.

Introduction

Text-driven video generation has empowered users to create cinematic content, yet controlling camera movements across multiple shots remains a major hurdle. Prior approaches struggle with the imprecision of implicit text prompts or the prohibitive manual effort and execution failures associated with explicit trajectory conditioning. To address these challenges, the authors introduce ShotVerse, a "Plan-then-Control" framework that decouples cinematic planning from video synthesis. They leverage a Vision-Language Model as a Planner to automatically generate unified 3D camera trajectories from text descriptions and a Controller to render high-fidelity multi-shot videos based on these precise paths. Central to this work is the creation of ShotVerse-Bench, a new dataset featuring disjoint shots aligned into a global coordinate system via an automated calibration pipeline, which enables the model to learn the spatial logic required for professional cinematography.

Dataset

  • Dataset Composition and Sources: The authors introduce ShotVerse-Bench, a dataset designed to align semantic descriptions with globally unified camera trajectories. It comprises 20,500 clips sourced from high-production cinema to ensure professional cinematography standards and a balanced taxonomy of camera controls.

  • Key Details for Each Subset:

    • Multi-Shot Sequences: The training subset consists of 1,000 scenes, while the test subset contains 100 scenes with no overlap.
    • Structure: These scenes are formed by grouping 2,750 representative single-shot clips into sequences of 249 frames.
    • Shot Distribution: Each multi-shot scene contains 2, 3, or 4 shots, distributed according to a 6:3:1 ratio.
  • Data Usage and Processing:

    • Preprocessing: The authors remove embedded subtitles and standardize all video resolution to 843 × 480 pixels.
    • Training Split: The 1,000 scenes are allocated for model training, and the remaining 100 scenes serve as the test set.
  • Calibration and Alignment Strategy: To unify disjoint single-shot trajectories into a global coordinate system, the authors employ a four-step pipeline:

    • Dynamic Foreground Removal: They use SAM to mask foreground objects, retaining only static background regions for robust pose estimation.
    • Local Reconstruction: Each shot is independently reconstructed using PI3 on the static background to create locally consistent trajectories.
    • Global Reconstruction: Keyframes across disjoint shots are sampled and jointly reconstructed via PI3 to generate a unified static scene and global poses.
    • Anchor-Based Alignment: The authors identify anchor frames present in both local and global reconstructions to estimate a similarity transformation. This aligns local frames to the global system and resolves scale ambiguity by comparing relative displacements of start and end keyframes.

Method

The authors propose ShotVerse, a framework designed to decouple the multi-shot camera control task into distinct planning and controlling phases. This "Plan-then-Control" architecture bridges the gap between unreliable textual camera control and labor-intensive manual plotting by modeling the joint distribution of aligned triplets consisting of captions, trajectories, and videos. The overall workflow is illustrated in the figure below, which outlines the three core components: Dataset Curation, the Planner for trajectory plotting, and the Controller for trajectory injection.

Dataset Curation The foundation of the framework relies on a curated dataset where multi-shot trajectories are aligned into a unified global coordinate system via camera calibration. This process pairs the visual data with hierarchical captions, including a global description and specific per-shot narratives, ensuring the model learns the relationship between semantic content and geometric camera movement.

Planner: Trajectory Plotting The Planner module is responsible for synthesizing cinematic, globally unified trajectories from hierarchical textual descriptions. It operates by modeling the conditional probability P(TrajectoryCaption)P(\text{Trajectory} | \text{Caption})P(TrajectoryCaption). The process begins with Hierarchical Prompt Construction, where the input sequence interleaves a global prompt with per-shot prompts and learnable trajectory query tokens. These tokens act as placeholders for the Vision-Language Model (VLM) to fill with shot-specific camera plans.

The VLM backbone, fine-tuned with LoRA, performs context-aware encoding. Through self-attention mechanisms, the hidden states of the trajectory query tokens aggregate information from the global context and previous shots. These encoded representations, referred to as camera codes, are then passed to a Trajectory Decoder. This decoder expands the fixed number of camera codes into a variable-length sequence of trajectory tokens. Finally, a Pose De-Tokenizer maps these generated tokens back to continuous values, recovering the explicit camera poses in the unified global coordinate system. The Planner is trained end-to-end to maximize the log-likelihood of the ground-truth trajectory tokens.

Controller: Trajectory Injection The Controller module renders the plotted trajectories into high-fidelity multi-shot videos by modeling P(VideoCaption,Trajectory)P(\text{Video} | \text{Caption}, \text{Trajectory})P(VideoCaption,Trajectory). It utilizes a holistic Diffusion Transformer (DiT) backbone. To ensure precise adherence to the camera paths, the authors implement a Camera Encoder that projects the extrinsic matrix of each frame into a feature vector. This camera embedding is directly added to the intermediate visual features before the self-attention layer, enforcing geometric adherence without disrupting pre-trained priors.

To handle the hierarchical temporal structure of multi-shot videos, the architecture employs a 4D Rotary Positional Embedding strategy. Unlike standard 3D embeddings, this approach partitions the attention head dimension into four subspaces corresponding to shot index, frame index, height, and width. This explicitly informs the model of shot boundaries, enforcing intra-shot consistency while maintaining fine-grained temporal dynamics. The Controller is optimized using a Flow Matching objective, predicting the velocity field between clean video latents and Gaussian noise samples conditioned on both text and camera embeddings.

Experiment

  • A three-track evaluation protocol was established to assess text-to-trajectory planning, trajectory-to-video execution fidelity, and end-to-end multi-shot generation quality, addressing the lack of systematic benchmarks for cinematic video creation.
  • The proposed Planner outperforms existing trajectory generation methods by leveraging vision-language priors to better align complex narrative prompts with camera sequences.
  • The Controller demonstrates superior execution fidelity and cross-shot consistency compared to single-shot baselines, effectively maintaining subject orientation and geometric alignment across cuts.
  • End-to-end results show that the integrated system achieves higher temporal coherence and aesthetic quality than both open-source and commercial models, which often fail to execute complex camera movements without explicit geometric guidance.
  • Ablation studies confirm that the VLM encoder, dedicated trajectory decoder, 4D positional encoding, and unified camera calibration are critical components for achieving stable shot transitions and cinematic pacing.
  • Training on real cinematic data rather than synthetic triplets is essential for preserving visual quality and temporal stability in the final output.

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています