Command Palette
Search for a command to run...
Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models
Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models
Abstract
Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.
One-sentence Summary
Researchers from UCAS, CASIA, Beihang University, BUPT, and Alibaba Group introduce Omni-WorldBench, a novel benchmark evaluating interactive response in 4D world models. Unlike prior static or fidelity-focused tools, it employs Omni-Metrics to quantify how actions drive spatiotemporal state transitions, revealing critical gaps in current generative and reconstruction paradigms.
Key Contributions
- The paper introduces Omni-WorldBench, a comprehensive benchmark designed to evaluate the interactive response capabilities of world models in 4D settings by addressing the lack of systematic evaluation for temporal dynamics and spatial structure.
- This work presents Omni-WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types, alongside Omni-Metrics, an agent-based framework that quantifies model performance by measuring the causal impact of interaction actions on outcomes and state evolution trajectories.
- Extensive evaluations of 18 representative world models across multiple paradigms reveal critical limitations in current interactive response capabilities, providing actionable insights for future research in interactive 4D world modeling.
Introduction
Video-based world models are critical for tasks like planning and counterfactual simulation, yet current evaluation methods fail to capture their core capability: interactive response in 4D environments. Existing benchmarks either prioritize visual fidelity and text-video alignment or rely on static 3D metrics that ignore temporal dynamics and limit interactions to simple camera movements. To address this, the authors introduce Omni-WorldBench, a comprehensive framework featuring Omni-WorldSuite for diverse interaction prompts and Omni-Metrics, an agent-based system that quantifies how actions causally drive state evolution across space and time.
Dataset
-
Dataset Composition and Sources The authors introduce Omni-WorldSuite, a benchmark containing 1,068 evaluation prompts designed to test interactive world models. The dataset spans general daily-life scenarios and task-oriented domains including autonomous driving, embodied robotics, and gaming. It is constructed using two primary strategies: dataset-grounded generation and concept-driven synthesis.
-
Key Details for Each Subset
- Dataset-Grounded Subset: This portion extracts initial frames and camera trajectories from open-source datasets to ensure realism. It utilizes DriveLM for autonomous driving, InternData-A1 for embodied robotics, and Sekai for gaming and simulation.
- Concept-Driven Subset: This portion synthesizes text, initial frames, and camera trajectories using a generate-verify-refine pipeline. It relies on prototype concepts covering indoor/outdoor scenes and specific interaction types, with initial frames generated by FLUX.1-dev and refined through manual screening.
- Interaction Levels: Prompts are categorized into three hierarchical levels: Level 1 (effects confined to a single object), Level 2 (localized interactions between objects), and Level 3 (global environmental changes affecting multiple objects). Level 2 contains the largest number of prompts.
-
Data Usage and Processing The authors use this dataset strictly for evaluation rather than model training. Each prompt serves as a test case comprising an initial frame image, a textual description of interaction-driven evolution, and optional camera trajectories. The evaluation pipeline measures generated video quality, camera-object controllability, and interaction effect fidelity to produce a unified AgenticScore.
-
Metadata Construction and Refinement To facilitate metric computation, the authors annotate each prompt with auxiliary metadata including lists of affected and unaffected entities, expected motion directions, and temporal event sequences. A multi-stage image generation pipeline ensures high-fidelity initial frames with a minimum resolution of 1024x1024. All generated captions and images undergo manual verification to correct spatial relations, eliminate linguistic ambiguity, and ensure physical plausibility.
Method
The authors introduce Omni-Metric, a comprehensive framework designed to facilitate an omni-directional assessment of world models. This framework delineates three pivotal dimensions to establish a rigorous paradigm for benchmarking perceptual quality, environmental stability, and causal reasoning capabilities. As shown in the framework diagram, the system evaluates Generated Video Quality, Camera-Object Controllability, and Interaction Effect Fidelity, ultimately aggregating these scores through an adaptive weighting mechanism.

Before computing specific metrics, the framework performs structured information extraction from the generated video v conditioned on an evaluation prompt P. The authors employ GroundingDINO and SAM to extract temporally consistent segmentation mask sequences for each entity, denoted as {{trajk}k=1N}. Additionally, RAFT is used to estimate the optical flow field F to capture regional motion intensity, while relative camera motion is approximated using optical flow variations between consecutive frames.
The Interaction Effect Fidelity dimension serves as a core contribution, quantitatively assessing long-term content consistency, causal logical ordering, and adherence to physical laws. To address these challenges, four comprehensive evaluation metrics are proposed. InterStab-L quantifies long-horizon temporal coherence by assessing visual content consistency across user-specified temporal revisit pairs R={(ta,tb)}. It integrates low-level structural fidelity and high-level semantic consistency via a composite similarity metric s(i,j):
s(i,j)=I1(SSIMgray(Ii,Ij)+cos(ϕ(Ii),ϕ(Ij))),where ϕ(⋅) represents a pre-trained vision encoder. To prevent trivial static sequences from inflating scores, a dynamics gating mechanism is incorporated. InterStab-N assesses the stability of non-target regions by measuring motion energy in areas outside the target entity masks. InterCov quantifies object-level causal faithfulness by verifying whether interaction-affected entities exhibit semantically consistent responses while unaffected entities maintain stability, leveraging Vision-Language Models (VLMs) for semantic validation. Finally, InterOrder quantifies the alignment between the chronology of propagated events and the ground-truth sequence E={ei}i=1K by using a VLM to verify temporal precedence.
Refer to the interaction examples to understand the complexity of scenarios evaluated, ranging from Level 1 static interactions to Level 3 complex dynamic interactions involving robotics and physics.

For the Generated Video Quality dimension, the authors leverage established metrics from prior benchmarks such as VBench and WorldScore, covering imaging quality, temporal flickering, motion smoothness, and content alignment. To effectively balance static and dynamic video attributes, the framework employs AgenticScore to perform adaptive weight allocation.
The AgenticScore mechanism treats each evaluation metric as an independent agent. Three interaction-centered agents compute scores for Interaction Effect Fidelity (AI), Generated Video Quality (AG), and Camera-Object Controllability (AC). An aggregation agent then analyzes the relative importance of these dimensions using an MLLM conditioned on the evaluation prompt, mapping the resulting ranking to predefined weight coefficients w1, w2, and w3. The final score is defined as:
AgenticScore=w1AI+w2AG+w3AC.This approach ensures that the evaluation adapts to diverse application scenarios, assigning different weights to different evaluation dimensions based on the semantic content of the prompt rather than simply averaging all metrics.
Experiment
- Camera-Object Controllability experiments validate a new evaluation framework that assesses scene coherence, object consistency, and transition detection, demonstrating that reframing object control as a visual question answering task improves robustness over rule-based matching.
- Comprehensive benchmarking of 18 world models across Text-to-Video, Image-to-Video, and camera-controlled paradigms reveals that Image-to-Video models generally achieve the highest overall performance, while camera-aware methods excel in specific controllability metrics.
- Quantitative and qualitative analyses confirm that while most models have mastered basic temporal smoothness and flickering reduction, significant limitations persist in maintaining causal interaction consistency, handling complex physical dynamics, and achieving joint camera-object control.
- Visual comparisons highlight that advanced models like Wan2.2 and HunyuanWorld successfully preserve anatomical integrity and scene logic during complex actions, whereas others suffer from structural collapse or the generation of spurious elements, validating the proposed Omni-Metric framework.