Command Palette
Search for a command to run...
عالم بري: مجموعة بيانات واسعة النطاق لنمذجة العالم الديناميكي مع الإجراءات والحالة الصريحة نحو ألعاب تقمص الأدوار التوليدية (ARPG)
عالم بري: مجموعة بيانات واسعة النطاق لنمذجة العالم الديناميكي مع الإجراءات والحالة الصريحة نحو ألعاب تقمص الأدوار التوليدية (ARPG)
Zhen Li Zian Meng Shuwei Shi Wenshuo Peng Yuwei Wu Bo Zheng Chuanhao Li Kaipeng Zhang
الملخص
تُعرِّف نظرية الأنظمة الديناميكية وتعلم التعزيز تطور العالم على أنه ديناميكيات حالة كامنة تُدفع بواسطة الأفعال، حيث توفر الملاحظات البصرية معلومات جزئية عن الحالة. وتُحاوِل نماذج العالم القائمة على الفيديو الحديثة تعلّم هذه الديناميكيات المشروطة بالأفعال انطلاقًا من البيانات. غير أن مجموعات البيانات الحالية نادرًا ما تلبي هذا الشرط: فهي تفتقر عادةً إلى فضاءات أفعال متنوعة وذات دلالة معنوية، كما أن الأفعال مرتبطة مباشرة بالملاحظات البصرية بدلًا من أن تُوسَّط بواسطة الحالات الكامنة. ونتيجة لذلك، تتداخل الأفعال غالبًا مع التغيّرات على مستوى البكسل، مما يُصعِّب على النماذج تعلّم ديناميكيات العالم ذات البنية والحفاظ على تطور متسق عبر آفاق زمنية طويلة. وفي هذه الورقة، نقترح WildWorld، وهي مجموعة بيانات ضخمة لنمذجة العالم مشروطة بالأفعال، تتضمن annotations صريحة للحالات، تم جمعها تلقائيًا من لعبة أدوار أكشن من فئة AAA واقعية ضوئيًا (Monster Hunter: Wilds). وتضم WildWorld أكثر من 108 مليون إطار، وتتميز بأكثر من 450 فعلًا، تشمل الحركة والهجمات وإسقاط المهارات، إلى جانب annotations متزامنة لكل إطار تشمل هياكل العظام للشخصيات وحالات العالم ومواضع الكاميرا وخرائط العمق. كما استخلصنا WildBench لتقييم النماذج عبر مهمتي "اتباع الفعل" (Action Following) و"محاذاة الحالة" (State Alignment). وتكشف التجارب الشاملة عن تحديات مستمرة في نمذجة الأفعال الغنية دلاليًا والحفاظ على اتساق الحالة عبر الآفاق الزمنية الطويلة، مما يبرز الحاجة إلى توليد فيديو واعٍ بالحالة. ويمكن زيارة صفحة المشروع على: https://shandaai.github.io/wildworld-project/.
One-sentence Summary
Researchers from Alaya Studio and multiple universities introduce WildWorld, a 108-million-frame dataset from Monster Hunter: Wilds featuring explicit state annotations to overcome the limitations of existing video models that struggle with long-horizon consistency and semantically rich action spaces.
Key Contributions
- The paper introduces WildWorld, a large-scale video dataset containing over 108M frames from a photorealistic AAA game, which provides explicit ground-truth annotations for player actions, character skeletons, world states, camera poses, and depth maps to support state-aware world modeling.
- This work presents WildBench, a benchmark designed to evaluate interactive world models using two specific metrics: Action Following to measure agreement with ground-truth sub-actions and State Alignment to quantify the accuracy of state transitions via skeletal keypoints.
- Extensive experiments conducted on WildBench compare baseline models against existing approaches, revealing current limitations in state transition modeling and offering insights for improving long-horizon consistency in generative ARPG environments.
Introduction
Understanding how the world evolves is central to building AI agents capable of long-horizon planning and reasoning, yet current approaches struggle because they rely on datasets with limited action semantics and lack explicit state information. Prior work often treats actions as direct visual changes or infers latent states implicitly from noisy observations, which fails to capture critical internal variables like ammunition counts that drive future outcomes. To address these gaps, the authors introduce WildWorld, a large-scale dataset of over 108M frames from a photorealistic game that provides explicit ground-truth annotations for actions, skeletons, and world states. They also present WildBench, a new benchmark with Action Following and State Alignment metrics, to rigorously evaluate how well models can disentangle state transitions from visual variations and maintain consistency over time.
Dataset
-
Dataset Composition and Sources: The authors introduce WildWorld, a large-scale dataset automatically collected from the AAA action role-playing game Monster Hunter: Wilds. It contains over 108 million frames featuring more than 450 distinct actions, including movement, attacks, and skill casting. The data captures diverse interactions across 29 monster species, 4 player characters, and 4 weapon types within 5 distinct environmental stages.
-
Key Details for Each Subset:
- Observations: Includes RGB frames, lossless depth maps, and synchronized camera poses (intrinsic and extrinsic parameters).
- States and Actions: Provides explicit ground truth such as character skeletons, absolute locations, rotations, velocities, animation IDs, and gameplay attributes like health and stamina.
- Annotations: Features 119 annotation columns per frame, including fine-grained action-level captions and sample-level summaries generated by large language models.
- WildBench: A derived benchmark subset of 200 manually curated samples designed to evaluate action following and state alignment, covering both cooperative and one-on-one combat scenarios.
-
Data Usage and Processing:
- Collection Pipeline: The authors utilize an automated gameplay system that navigates menus and leverages rule-based companion AI to generate diverse combat and traversal trajectories without human intervention.
- Synchronization: A custom recording system based on OBS Studio and Reshade captures RGB and depth streams simultaneously, embedding timestamps to align data from multiple sources.
- Training Application: The dataset supports training action-conditioned world models by providing structured state dynamics rather than relying solely on visual observations.
-
Filtering and Construction Strategies:
- Quality Filters: The authors apply multi-dimensional filters to remove low-quality samples, including discarding clips shorter than 81 frames, eliminating temporal gaps exceeding 50 ms, and removing sequences with extreme luminance levels.
- Occlusion Handling: Samples with camera occlusions (detected via spring-arm contraction) or severe character overlap (exceeding 30% projected area) are excluded to ensure visual clarity.
- Caption Generation: Hierarchical captions are created by segmenting samples into action sequences, sampling frames at 1 FPS, and using a vision-language model to generate detailed descriptions enriched with action and state ground truth.
- Visual Cleanup: The pipeline disables HUD shaders to produce clean, HUD-free frames that better reflect the game world for model training.
Method
The authors establish a robust data foundation through an automated game play system designed for high-fidelity recording. Refer to the framework diagram. This system employs a timestamp-embedded recording mechanism to ensure frame-wise synchronization across multiple data streams. The Data Acquisition Platform captures diverse modalities including camera pose, location, rotation, velocity, animation IDs, health status, and skeletal pose. Following collection, the data undergoes a rigorous processing pipeline involving timestamp alignment and filtering for abnormal frames and samples with issues such as short duration, shot-cuts, extreme luminance, or occlusion.

Building upon this dataset, the authors develop three distinct video generation approaches. For camera-conditioned generation, they fine-tune the Wan2.2-Fun-5B-Control-Camera model using ground-truth per-frame camera poses, creating the CamCtrl model. This contrasts with baseline approaches that rely on rule-based conversions of discrete actions. For skeleton-conditioned generation, the SkelCtrl model is introduced. This approach utilizes per-frame 3D skeleton keypoints projected into screen coordinates under the ground-truth camera pose to render colored-skeleton videos as control signals for the Wan2.2-Fun-5B-Control model.
The most comprehensive approach is the state-conditioned StateCtrl model. This architecture injects structured state information into the video generation process. States are categorized into discrete types, such as monster type and weapon category, and continuous types, such as coordinates and health. Discrete states are mapped via trainable embeddings, while continuous states are encoded using an MLP. The authors adopt a hierarchical modeling strategy that distinguishes between entity-level states and global-level states, such as recording time. A Transformer architecture models the relationships between entities to produce a unified state embedding. This embedding is aligned with video frames and injected into the intermediate layers of the DiT as a conditioning signal. To ensure the quality of these representations, a state decoder recovers state information from the embedding, and a state predictor forecasts the next-frame state. During training, decoder loss and predictor loss are applied to preserve state fidelity and enhance temporal consistency. For inference, the model supports autoregressive prediction of subsequent states based on the first frame's ground-truth state, denoted as StateCtrl-AR.
All models are trained at a resolution of 544×960 with 81 frames per sample at a frame rate of 16 FPS. The training process utilizes a batch size of 1 and a learning rate of 1×10−5, running for 250,000 iterations with a batch size of 8 using the Adam optimizer. During inference, the system maintains the same resolution and frame rate while employing 50 sampling steps.
Experiment
- WildBench benchmark validates interactive world models by evaluating video quality, camera control, action following, and state alignment, distinguishing itself from existing benchmarks that focus primarily on perceptual quality.
- Reliability experiments confirm that the proposed Action Following metric aligns with human judgments at 85% agreement, while the State Alignment metric effectively measures consistency between generated and ground-truth state evolution.
- Comparative evaluations demonstrate that diverse approaches trained on the WildWorld dataset outperform baselines on interaction-related metrics, proving the dataset's utility for improving camera control and action responsiveness.
- Analysis reveals that standard video quality metrics like VBench are saturated and fail to capture nuanced differences in dynamic motion, whereas action and state metrics provide necessary fine-grained assessment.
- Qualitative findings indicate a trade-off where models using visual signals for control achieve better interaction fidelity but suffer from reduced aesthetic and image quality compared to those using learned embeddings.
- Experiments on autoregressive models show promise for future development but currently exhibit degraded action following due to error accumulation in iterative state prediction.