HyperAIHyperAI

Command Palette

Search for a command to run...

عالم جاما: النمذجة التوليدية للعالم متعدد الوكلاء بما يتجاوز لاعبين اثنين

Fangfu Liu Kai He Tianchang Shen Tianshi Cao Sanja Fidler Yueqi Duan Jun Gao Igor Gilitschenski Zian Wang Xuanchi Ren

الملخص

ركّزت نماذج العالم الخاصة بالتوليد التفاعلي للفيديو بشكل كبير على الإعدادات أحادية الـ agent، حيث تُستنبط الملاحظات المستقبلية من إشارة تحكم واحدة. ومع ذلك، تتطلب العديد من البيئات المُولَّدة تفاعلاً متعدد الـ agents، إذ يتفاعل عدد من اللاعبين، أو الروبوتات، أو الـ agents المجسّدة في آنٍ واحد ضمن مساحة مشتركة. ويتطلب تعميم نماذج العالم على مثل هذه الإعدادات تصميمًا متعدد الـ agents يتسم بالصرامة المنهجية، حيث ينبغي أن تظل الـ agents قابلة للتحكم بشكل مستقل، ومتناظرة بالنسبة للتباديل (permutation-symmetric)، وقادرة على دعم الاستدلال بكفاءة مع الحفاظ على الاتساق عبر الزمن ومنظورات مختلفة. في هذه الورقة البحثية، نقدم نموذج العالم متعدد الـ agents الخاص بنا المُولِّد للمحاكاة التفاعلية. ويُقدّم هذا النموذج ترميز الـ agent الدائري Simplex (Simplex Rotary Agent Encoding)، وهو امتداد خالٍ من المعلمات لـ 3D RoPE يُمثّل الـ agents كقِمَم لمُضلع منتظم (simplex) في فضاء زاوية الدوران. ويمنح هذا كل agent طورًا مميزًا، في حين يجعل جميع الـ agents متكافئة بالنسبة للتباديل، مما يتيح قابلية توسيع هوية الـ agent دون الحاجة إلى هويات مُتعلَّمة لكل slot أو ترتيب ثابت للـ agents. وتجنّبًا للانتباه الكثيف (dense all-to-all attention) بين الـ agents، نقترح أيضًا آلية Sparse Hub Attention، حيث تعمل الـ hub tokens القابلة للتعلم كوسيط لتفاعل الـ tokens عبر الـ agents، مما يقلل تكلفة الانتباه المتقاطع بين الـ agents من تعقيد تربيعي إلى خطي بالنسبة لعدد الـ agents. ومن أجل التنفيذ في الوقت الفعلي (real-time rollout)، نقوم بتكثيف (distill) نموذج معلم انتشار ذي سياق كامل إلى نموذج طالب سببي (causal student) يولد كتلًا زمنية بشكل تسلسلي باستخدام التخزين المؤقت لـ KV، مما يتيح توليدًا يستجيب للإجراءات بمعدل 24 إطارًا في الثانية. وتُظهر التجارب في البيئات الافتراضية متعددة اللاعبين أن نموذجنا يحسّن من دقة الفيديو، وقابلية التحكم في الإجراءات، والاتساق بين الـ agents مقارنة بالنماذج الأساسية القائمة على الـ slot والانتباه الكثيف، مع قدرته على التعميم من لاعبين إلى أربعة دون الحاجة إلى تدريب إضافي.

One-sentence Summary

Gamma-World is a generative multi-agent world model that employs Simplex Rotary Agent Encoding to establish permutation-symmetric identities and Sparse Hub Attention to reduce cross-agent computational complexity from quadratic to linear, enabling real-time, action-responsive video generation that enhances fidelity, controllability, and inter-agent consistency across two to four players without additional training.

Key Contributions

  • Introduces γ\gammaγ-World, a generative multi-agent world model featuring Simplex Rotary Agent Encoding. This parameter-free extension of 3D RoPE maps agents to the vertices of a regular simplex in rotary angle space to preserve permutation symmetry while assigning distinct phases without fixed orderings or learned per-slot identities.
  • Proposes Sparse Hub Attention, a cross-agent communication mechanism that routes interactions through learnable hub tokens. This architecture reduces the computational complexity of cross-agent attention from quadratic to linear relative to the number of agents.
  • Evaluates γ\gammaγ-World in multiplayer virtual environments to demonstrate improved video fidelity, action controllability, and inter-agent consistency over slot-based and dense-attention baselines. The framework distills a full-context diffusion teacher into a causal student with KV caching to enable real-time, action-responsive generation at 24 FPS and generalizes from two to four players without additional training.

Introduction

The authors address the growing need for controllable multi-agent world modeling, which is essential for realistic multiplayer game generation, interactive simulation, and embodied AI. Prior video world models largely remain single-agent systems, and existing multi-agent approaches struggle with scalability and structural symmetry. They rely on dense joint attention that incurs quadratic computational costs as agent count increases, and they use learned identity embeddings that break permutation symmetry and lock the model to fixed player rosters. To overcome these bottlenecks, the authors introduce γ\gammaγ-World, a scalable generative framework that enables interactive multi-agent simulation. They leverage Simplex Rotary Agent Encoding, a parameter-free method that positions agents at equal distances in rotary angle space to preserve permutation symmetry while maintaining distinct identities. They also implement Sparse Hub Attention, which routes cross-agent communication through learnable hub tokens to reduce computational complexity from quadratic to linear. By distilling a bidirectional teacher into a causal student with KV caching, the authors enable real-time 24-FPS autoregressive rollouts and demonstrate seamless scaling from two to four players without additional training.

Dataset

  • Dataset Composition and Sources: The authors build a paired video and action dataset spanning two domains: Minecraft-style game environments and robotic manipulation tasks. Each sample aligns video footage with explicit per-agent action traces.
  • Subset Details and Action Formats: The game subset stores 25-field action vectors per frame, containing 23 discrete player controls for inventory, hotbar selection, movement, item manipulation, and mouse interactions, plus 2 continuous fields for horizontal and vertical camera motion. The robot subset uses 10-field continuous vectors per frame, recording 3D end-effector position, 6D orientation, and gripper opening values. Both left and right robotic agents share this format, producing one temporally aligned sequence per robot.
  • Model Usage and Training Integration: The authors treat these action traces as explicit conditioning signals during model training. Actions are synchronized frame-by-frame with the video data and supplied as independent per-agent streams to guide generation.
  • Additional Processing Details: The provided excerpt does not specify dataset size, filtering rules, training splits, mixture ratios, cropping strategies, or metadata construction. The documentation focuses exclusively on action vector formatting and temporal alignment with video frames.

Method

The authors leverage a transformer-based latent video diffusion framework adapted for autoregressive generation, building on the DiT architecture. The model processes synchronized multi-agent inputs, where observations and actions from PPP agents are tokenized and encoded into a shared latent space. The input representation is structured as Z0RP×T×H×W×Cz\mathbf{Z}_0 \in \mathbb{R}^{P \times T \times H \times W \times C_z}Z0RP×T×H×W×Cz, extending the single-agent latent with an explicit agent axis. During training, the model is conditioned on initial observations and per-agent action sequences to predict future latent observations for all agents jointly, ensuring consistency across time and agent perspectives. At inference, the first observations are encoded as context, while future latent tokens are initialized from noise and denoised block by block under the per-agent action sequences.

Action conditioning is implemented using a shared action encoder faf_afa that maps each agent's action sequence a1:Tp\mathbf{a}_{1:T}^pa1:Tp to a hidden action feature utpRD\mathbf{u}_t^p \in \mathbb{R}^DutpRD. This feature is projected to a layer-specific action bias βItp\boldsymbol{\beta}_{\mathcal{I}_t}^pβItp and broadcast to all spatial tokens of the corresponding agent and frame, allowing the model to incorporate action information without breaking permutation symmetry. The model further enhances this by modifying the 3D rotary position embedding (RoPE) to account for agent identities, introducing a 4D rotary operator R4D(t,p,h,w)\mathbf{R}_{\text{4D}}(t, p, h, w)R4D(t,p,h,w) that includes an agent axis ppp.

To address the challenge of agent identity representation without imposing a fixed ordering, the authors propose Simplex Rotary Agent Encoding. This method represents agents as vertices of a regular simplex in the rotary angle space, ensuring that all agents are equidistant and exchangeable. For a batch with PVP \leq VPV active agents, an injective assignment π\piπ maps agents to vertices from a fixed simplex pool of size VVV. The agent-band rotation angles are defined as θp=αsπ(p)\theta_p = \alpha \, s_{\pi(p)}θp=αsπ(p), where sπ(p)s_{\pi(p)}sπ(p) is the selected simplex vertex. This encoding is parameter-free, permutation-symmetric, and supports scaling to more agents by selecting additional unused vertices from the same pool.

To reduce computational cost associated with dense cross-agent attention, the authors introduce Sparse Hub Attention (SHA). This mechanism uses a small set of learnable hub tokens to mediate information flow between agents. Agent tokens attend only to tokens from the same agent stream and to the hub tokens, while hub tokens attend to all agents and to other hub tokens. Direct attention between distinct agent streams is masked, enforcing a two-hop communication path: agent → hub → agent. The sequence is organized as PTLPTLPTL agent tokens followed by TKTKTK hub tokens, with KKK hub tokens per latent frame. The hub-and-spoke topology is defined by the mask Mhub(i,j)=1[ρ(i)=ρ(j)ρ(i)=hubρ(j)=hub]\mathcal{M}_{\text{hub}}(i, j) = \mathbf{1}[\rho(i) = \rho(j) \vee \rho(i) = \text{hub} \vee \rho(j) = \text{hub}]Mhub(i,j)=1[ρ(i)=ρ(j)ρ(i)=hubρ(j)=hub], where ρ(i)\rho(i)ρ(i) denotes the identity of token iii. For causal autoregressive generation, this topology is composed with a block-causal mask M(i,j)=1[b(j)b(i)]Mhub(i,j)\mathcal{M}(i, j) = \mathbf{1}[b(j) \leq b(i)] \cdot \mathcal{M}_{\text{hub}}(i, j)M(i,j)=1[b(j)b(i)]Mhub(i,j), where b(i)b(i)b(i) is the temporal block index of token iii. This reduces the per-block attention cost from O(P2n2L2)O(P^2 n^2 L^2)O(P2n2L2) to O(PnL(nL+nK))+O(nK(PnL+nK))O(P n L (n L + n K)) + O(n K (P n L + n K))O(PnL(nL+nK))+O(nK(PnL+nK)), which is linear in PPP for fixed block size nnn, spatial length LLL, and number of hub tokens KKK.

The training process is conducted in three stages to support real-time rollout. First, a bidirectional teacher model is trained for high-quality conditional denoising, exploiting full temporal and cross-agent visibility. Second, a causal student model is trained using the Diffusion Forcing formulation, combining block-causal attention with the Sparse Hub Attention mask. This causal student is trained as a full multi-step diffusion model, providing a stable starting point for distillation. Finally, a conditional Self-Forcing distillation process is applied, where the multi-step causal student is distilled into a few-step generator. The distillation uses distribution matching distillation (DMD) with rollout-aware training, encouraging the few-step student to preserve quality over its own generated histories. The model is trained with conditional distillation, ensuring that the initial observation and action controls are preserved. At inference time, the distilled student generates one temporal block at a time, conditioned on the initial observations and the latest per-agent action block, streaming the rollout at 24 FPS. KV caches are maintained for each agent stream and a shared cache for hub tokens to preserve the Sparse Hub Attention topology during streaming.

Experiment

Evaluated across synchronized multi-agent Minecraft scenarios and real-world bimanual robotics tasks, the experiments validate that representing agents as distinct yet exchangeable entities coupled through a shared communication hub significantly improves cross-view consistency and generation fidelity over existing baselines. Qualitative demonstrations confirm the model maintains synchronized interactions, robust object grounding, and coordinated dynamics when scaling from two to four agents without architectural modifications. Efficiency analyses further verify that the sparse hub attention mechanism drastically reduces computational overhead as agent counts grow, enabling practical real-time inference. Ultimately, the framework successfully generalizes coupled multi-agent dynamics from virtual simulations to physical environments, establishing a scalable foundation for interactive world modeling.

The authors conduct an ablation study to evaluate the impact of different architectural choices in their multi-agent world model, focusing on input organization, agent identity encoding, and cross-agent interaction. Results show that combining sequence-based input organization, simplex agent encoding, and sparse hub attention leads to the best performance across multiple quality metrics. The full model achieves the lowest values for FVD, FID, and LPIPS, indicating improved visual quality and consistency, while also achieving higher PSNR and SSIM, suggesting better perceptual and pixel-level fidelity. The full model with sequence concatenation, simplex agent encoding, and sparse hub attention outperforms all other architectural variants in visual quality and consistency. Simplex rotary agent encoding improves over learned view embeddings by enabling distinct yet exchangeable agent identities. Sparse hub attention reduces computation and latency as the number of agents increases, enabling efficient multi-agent interaction.

The authors compare their method, γ\gammaγ-World, against two baselines in multi-agent world modeling, evaluating performance across several tasks including memory, grounding, movement, building, and consistency. Results show that γ\gammaγ-World outperforms the baselines in all categories, demonstrating improved visual quality and inter-agent consistency. The framework's design, which treats agents as distinct yet exchangeable entities and uses an efficient shared interaction pathway, contributes to its superior performance and scalability. γ\gammaγ-World achieves better performance than baselines across multiple evaluation tasks, including memory, grounding, movement, building, and consistency. The method outperforms frame concatenation and Solaris, indicating the effectiveness of its approach to agent identity encoding and sparse cross-agent interaction. Results show that γ\gammaγ-World maintains strong visual quality and consistency, particularly in scenarios requiring memory and cross-view coherence.

The authors present an ablation study comparing different training stages of their model, including a bidirectional teacher, a causal student, and a distilled variant. The results show that the bidirectional teacher achieves the best performance across multiple metrics, while the distilled model recovers much of the teacher's quality while maintaining a causal structure suitable for streaming inference. The causal variant exhibits degraded performance due to limited temporal context. The bidirectional teacher achieves the best performance across all evaluation metrics. The distilled model recovers most of the bidirectional teacher's quality while supporting streaming inference. The causal variant shows degraded performance due to limited access to future context.

{"summary": "The authors conduct an ablation study to evaluate the impact of the number of hub tokens in Sparse Hub Attention on generation quality. Results show that increasing the number of hub tokens leads to improvements in most metrics, indicating that a larger hub capacity enhances the model's ability to summarize and communicate multi-agent interactions. However, the gains diminish at higher hub token counts, suggesting a trade-off between communication capacity and efficiency.", "highlights": ["Increasing the number of hub tokens improves generation quality across multiple metrics, with diminishing returns at higher counts.", "A larger hub capacity enables better summarization of multi-agent interactions, leading to higher perceptual and pixel-level quality.", "The study demonstrates that hub token count is a critical design parameter for balancing communication capacity and model efficiency."]

The authors evaluate their multi-agent world model through a series of ablation studies and comparative benchmarks to assess architectural design, training strategies, and hyperparameter configurations. Results demonstrate that combining sequence-based input organization, simplex agent encoding, and sparse hub attention optimizes visual quality and cross-agent consistency, significantly outperforming established baselines across diverse simulation tasks. Training stage comparisons further reveal that while a bidirectional teacher achieves peak performance, a distilled variant successfully preserves this quality while enabling efficient streaming inference. Finally, analysis of the attention mechanism highlights a clear trade-off between communication capacity and computational efficiency, collectively validating the framework's robust and scalable design.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp