HyperAIHyperAI

Command Palette

Search for a command to run...

CoVEBench: 動画編集モデルは複雑な指示を処理できるか?

Jiangtao Wu Jiaming Wang Yiwen He Yuanxing Zhang Shihao Li Dunyuan Liu Xuedong Zhao Jialu Chen Zekun Moore Wang Jiaheng Liu

概要

近年のテキスト誘導型動画編集モデルは、スタイル転送やオブジェクト挿入といった基礎的なタスクにおいて優れた性能を示しているが、現実のユーザーからのリクエストは高度に複合的である。単一のプロンプトは、被写体や動作、カメラアングルの変更など、複数の連動した編集を要求する場合が多く、その一方で無関係な時空間的なコンテンツを厳密に保持することが求められる。既存のベンチマークは単一の編集と粗いグローバル指標に大きく制約されており、モデルがこのような複雑なワークフローをどのように処理するかを適切に診断できていない。この課題に対処するため、厳選された416本のソース動画、626の多点編集指示、9,990の細粒度チェックリスト項目で構成される複合動画編集ベンチマーク「CoVEBench」を提案する。多様な編集次元を網羅するCoVEBenchは、MLLMによる判定に基づく指示準拠性と動画忠実度、ならびに動画品質の自動評価指標を用いてモデルを評価する。広範な実験により、複合編集が依然として深刻な課題であることが明らかとなった。現在のモデルは複数の操作を同時に処理する際、編集の省略、保持制約の違反、またはアーティファクトの発生を頻繁に引き起こす。CoVEBenchは、動画編集を実際のユーザーワークフローに近づけるための、挑戦的で診断的なテストベッドを提供するものである。

One-sentence Summary

Addressing the limitations of prior benchmarks constrained by isolated edits and coarse metrics, the authors introduce CoVEBench, a diagnostic testbed comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items that evaluates compositional video editing through MLLM-judged instruction compliance and video fidelity alongside automated quality metrics.

Key Contributions

  • The work introduces CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items.
  • The benchmark establishes an evaluation framework that utilizes MLLM-judged instruction compliance and video fidelity alongside automated metrics to measure multi-operation adherence and physical realism.
  • Extensive experiments demonstrate that current video editing models struggle with complex compositional workflows, frequently omitting edits, violating preservation constraints, or introducing artifacts during simultaneous operations.

Introduction

Instruction-guided video editing models have advanced rapidly, yet real-world creation requires compositional workflows where users issue multiple simultaneous edits while strictly preserving background and subject integrity. This capability is vital for professional pipelines like film post-production and advertising localization, where failure to maintain consistency across coupled operations renders outputs unusable. Prior benchmarks lag behind these demands by focusing on trivial isolated edits and relying on coarse global metrics that cannot diagnose specific execution failures or verify physical realism. To bridge this gap, the authors present CoVEBench, a benchmark featuring 626 multi-point instructions across 416 videos that cover complex operations such as camera motion and structural adjustments. The authors leverage multimodal large language models to decompose these instructions into nearly 10,000 verifiable checklist items, enabling fine-grained diagnostics on instruction compliance, modification quality, and semantic preservation to rigorously assess model performance on realistic tasks.

Dataset

  • Dataset Composition and Sources

    • The authors introduce CoVEBench, a benchmark for compositional video editing comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items.
    • Source videos are aggregated from stock platforms including Pexels and Mixkit, alongside academic datasets such as Vript, UltraVideo, ViDiC, and LMArena.
  • Subset Details and Filtering

    • The authors apply strict filtering criteria to the video pool, retaining only clips with resolution at least 480p, durations between 3 and 21 seconds, and high visual quality.
    • Cross-pool near-duplicate removal and human review for editability reduce the collection to the final 416 videos.
    • Editing instructions are built upon a seven-dimension taxonomy covering Subject, Background, Camera, Style, Motion, Position, and Special Effects.
    • The authors define 83 category combinations and use MLLMs to generate instructions, ensuring an average of three atomic edits per prompt. Manual review removes inappropriate or repetitive outputs.
    • Checklists are synthesized by LLMs using editing instructions and source video descriptions. The authors retain approximately 67.2% of the initial outputs after rigorous manual filtering.
  • Data Usage and Processing

    • The authors explicitly state that CoVEBench serves exclusively as an evaluation benchmark and does not provide a paired large-scale training dataset.
    • The data is not used for model training or fine-tuning; it functions as a diagnostic testbed to assess compositional editing workflows.
    • Evaluation utilizes MLLM-judged instruction compliance and video fidelity, supplemented by automated metrics for video quality.
    • The checklist framework enables diagnostic assessment of execution accuracy, modification realism, and semantic preservation.
  • Metadata Construction and Structural Details

    • The authors construct metadata by generating detailed text descriptions of source videos, which are fed into LLMs alongside editing instructions to produce verification questions.
    • The checklist decomposes complex instructions into evaluation groups targeting execution accuracy, physical logic, and semantic preservation.
    • Questions include multiple-choice formats for object visibility, true/false checks for absence and physical consistency, and scoring rubrics to measure attribute preservation and structural distortion.
    • The structured output organizes evaluation groups by edit points, allowing separate assessment of atomic edit execution, interference between operations, and preservation of unrelated content.
    • No cropping strategy is described; the dataset relies on the duration and resolution constraints applied during curation.

Method

The framework for evaluating video editing models in CoVEBench is structured around a comprehensive, multi-stage pipeline that begins with the decomposition of complex editing instructions into atomic operations and culminates in a fine-grained diagnostic assessment. The process starts with a source video and a single-point edit instruction, which is then decomposed into a compositional instruction that explicitly details each required change, including additions, attribute modifications, spatial repositioning, and preservation constraints. This compositional instruction is fed into video editing models to generate an edited output, which is subsequently evaluated against the expected outcome. The evaluation is organized into three primary dimensions: Instruction Compliance, Video Quality, and Video Fidelity. Instruction Compliance assesses whether the specific editing instruction has been correctly applied, focusing on execution accuracy. Video Quality evaluates internal physical consistency, ensuring that the edited video adheres to physical laws such as gravity, momentum, and lighting, independent of the source video. Video Fidelity measures the preservation of unmodified elements, including semantic consistency, structural fidelity, motion fidelity, and static region consistency. Each of these dimensions is assessed using a set of fine-grained checklist questions that are designed to be verifiable through direct visual inspection.

The evaluation process leverages multiple large language models (LLMs) to generate and refine these checklists. The checklist generation process is guided by a taxonomy of seven core editing dimensions—subject manipulation, background modification, style transfer, motion control, spatial repositioning, camera editing, and visual effects—which are used to categorize editing instructions and ensure comprehensive coverage of diverse editing scenarios. Each editing instruction is independently classified by a labeling model to verify that it aligns with the intended category, preventing hallucinations and ensuring accurate auditing. The checklists are generated using a system prompt that instructs the LLM to act as a video QA specialist, producing a rigorous, multi-format evaluation checklist in strict JSON format. These checklists are designed to be highly specific and verifiable, with questions categorized into three dimensions: Execution Accuracy, Physical Logic, and Semantic Preservation, each using one of four question formats that mirror the prompts used in the evaluation.

The evaluation is conducted by specialized AI judges that operate under strict principles and constraints. For single-video true/false questions, the judge observes only the edited video and answers with a "Yes" or "No" based on factual visual evidence, applying a reasonable tolerance for minor deviations in physical laws. For dual-video true/false and scoring questions, the judge compares the source and edited videos, ensuring that the unmodified elements are preserved. Semantic Preservation questions are exclusively scored using a 1-10 scale, with a detailed rubric that categorizes failures from complete absence of the target to partial remnants. The framework also includes a physics forensics module that evaluates the edited video for real-world physics violations such as collisions, gravity, inertia, lighting, and material dynamics, with deductions applied for severe and minor violations. This comprehensive approach ensures a thorough and reliable assessment of video editing models across a wide range of editing dimensions and scenarios.

Experiment

The CoVEBench evaluation framework assesses video editing models across instruction compliance, visual quality, and content preservation using automated metrics and extensive MLLM-based checklists. The experiments validate model capabilities across varying temporal spans, instructional complexities, and computational constraints, revealing that closed-source models generally achieve stronger instruction adherence while open-source variants struggle with balancing edit execution and content preservation. Further analysis demonstrates that performance consistently degrades under complex multi-constraint scenarios, exposing critical limitations in physical realism, spatial control, and sequential editing stability. Ultimately, these findings highlight a fundamental bottleneck in compositional spatiotemporal reasoning that must be addressed to enable reliable and realistic video editing.

The authors evaluate video editing models using a comprehensive framework that assesses instruction compliance, video quality, and video fidelity. The results show that closed-source models generally perform better on instruction compliance, while open-source models exhibit varying strengths across different editing categories, with some excelling in style and background edits but struggling with complex, multi-constraint tasks. There is a clear trade-off between successful edit execution and preserving unaltered content, and models differ significantly in their inference efficiency and robustness under increasing temporal and editing complexity. Closed-source models achieve higher instruction compliance compared to open-source models, though the sample size is limited. Models show a trade-off between successful edit execution and preservation of unedited content, with some excelling in one area at the expense of the other. There is significant variation in model performance across different editing categories, with challenges remaining in camera control, motion editing, and subject replacement.

The authors evaluate video editing models across instruction compliance, video quality, and video fidelity, using a comprehensive set of metrics to assess execution accuracy, visual naturalness, and content preservation. Results show that closed-source models achieve higher instruction compliance, while open-source models exhibit varying strengths across different editing categories and face challenges in maintaining both edit accuracy and semantic consistency. The evaluation reveals a trade-off between successful editing and preserving unedited content, with models often sacrificing fidelity for execution. Closed-source models demonstrate higher instruction compliance compared to open-source models across all evaluation metrics. Open-source models vary significantly in performance across editing categories, with stronger results on style and background edits but weaker performance on complex tasks like motion and subject manipulation. There is a clear trade-off between edit execution and content preservation, as models that perform well on instruction following often exhibit lower semantic consistency in preserved regions.

The authors present a comprehensive evaluation framework for video editing models, focusing on instruction compliance, video quality, and video fidelity across multiple benchmarks. The analysis reveals that current models struggle with complex edits, particularly in maintaining content preservation while executing instructions, and exhibit varying strengths across different editing categories. Closed-source models show stronger instruction compliance compared to open-source models, though this is based on a limited sample. There is a significant trade-off between edit execution and content preservation, with some models excelling in one area while failing in the other. Current models perform better on global transformations like style transfer and background editing but struggle with precise spatiotemporal control and fine-grained object manipulation.

The authors analyze the performance of video editing models across various conditions, focusing on accuracy trends under different temporal and complexity factors. Results show that model performance generally declines as task complexity increases, with significant variations in how different models handle longer sequences, more edit points, and longer instructions. Kiwi demonstrates superior robustness in handling longer generated frames and more complex instructions compared to other models. Model accuracy tends to decrease as the number of generated frames, edit points, and instruction length increases. Kiwi exhibits better robustness in long-sequence generation and complex instruction handling compared to other models. Performance degradation is more pronounced for models when dealing with longer source videos and higher edit complexity.

The authors evaluate multiple video editing models using a comprehensive set of metrics across instruction compliance, video quality, and video fidelity. Results show that models vary significantly in their performance, with some achieving higher compliance but struggling with content preservation, and others showing trade-offs between edit execution and visual quality. Closed-source models demonstrate stronger instruction compliance compared to open-source models. Models face a trade-off between executing edits correctly and preserving unedited content. Performance declines as editing complexity increases, particularly with longer instructions and more edit points.

The authors evaluate video editing models using a comprehensive framework that assesses instruction compliance, visual quality, and content preservation across diverse editing tasks and complexity levels. These experiments validate core model capabilities and limitations, revealing that closed-source systems generally lead in instruction compliance while open-source models exhibit category-specific strengths but struggle with complex manipulations. The analysis consistently highlights a qualitative trade-off between executing edits accurately and preserving original content, alongside a clear performance decline as temporal length and instruction complexity increase. Despite these challenges, certain models demonstrate notably better robustness to extended sequences and intricate editing demands.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています