HyperAIHyperAI

Command Palette

Search for a command to run...

GigaBrain-0.5M*: نموذج بصري لغوي يتعلم من التعلم المعزز القائم على نموذج العالم

الملخص

النماذج التي تجمع بين الرؤية واللغة والإجراءات (VLA) التي تتوقع مباشرةً تسلسلات متعددة الخطوات من الإجراءات بناءً على الملاحظات الحالية تواجه قيودًا جوهرية ناتجة عن محدودية فهم المشهد وضعف قدرات التنبؤ بالمستقبل. بخلاف ذلك، تُظهر نماذج العالم المرئية المُدرّبة على مجموعات بيانات فيديو ضخمة من الإنترنت قدرات قوية على التفكير المكاني والزمني والتنبؤ الدقيق بالمستقبل، مما يجعلها أساسًا طبيعيًا لتعزيز تعلم نماذج VLA. لذلك، نقترح نموذج GigaBrain-0.5M، وهو نموذج VLA يُدرّب باستخدام التعلم القائم على نموذج العالم. تم بناء هذا النموذج على أساس GigaBrain-0.5، الذي تم تدريبه مسبقًا على أكثر من 10,000 ساعة من بيانات التلاعب الروبوتي، ويُعد نسخته الوسطى حاليًا الأفضل عالميًا في معيار RoboChallenge. ويُضيف GigaBrain-0.5M تدريبًا قائمًا على نموذج العالم باستخدام منهجية RAMP (التعلم بالتعزيز عبر سياسة مشروطة بنموذج العالم)، مما يمكّن النموذج من التكيف القوي عبر المهام المختلفة. تُظهر النتائج التجريبية أن RAMP تحقق تحسنًا كبيرًا مقارنةً بالأساس RECAP، مع تحسينات تصل إلى حوالي 30٪ في المهام الصعبة مثل طي الملابس، تعبئة الصناديق، وإعداد القهوة الإسبريسو. وبشكل حاسم، يُظهر GigaBrain-0.5M* تنفيذًا موثوقًا على مدى طويل، ويُنجز باستمرار مهام معقدة للتمام الروبوتي دون فشل، كما تم التحقق من ذلك من خلال مقاطع فيديو حقيقية من تطبيقات عملية على صفحتنا الرسمية: https://gigabrain05m.github.io.

One-sentence Summary

The GigaBrain Team proposes GigaBrain-0.5M*, a world model-enhanced VLA trained via RAMP reinforcement learning, enabling robust cross-task adaptation and 30% gains over RECAP on complex robotic tasks like Laundry Folding and Espresso Preparation, validated through real-world deployment.

Key Contributions

  • GigaBrain-0.5M* addresses the limited scene understanding and weak future anticipation of standard VLA models by integrating a video world model trained on web-scale and robotic manipulation data, enabling more robust spatiotemporal reasoning for long-horizon tasks.
  • The model introduces RAMP (Reinforcement leArning via world Model-conditioned Policy), a novel training framework that conditions policy learning on world model predictions, allowing self-improvement through human-in-the-loop rollouts and enhancing cross-task adaptation without relying on imitation or policy gradients.
  • Evaluated on challenging real-world tasks including Laundry Folding and Espresso Preparation, GigaBrain-0.5M* achieves approximately 30% improvement over the RECAP baseline and demonstrates reliable long-horizon execution, validated by real-world deployment videos and top performance on the RoboChallenge benchmark.

Introduction

The authors leverage world model-based reinforcement learning to address the limited temporal reasoning in vision-language-action (VLA) models, which typically generate actions based only on immediate observations—hindering performance on long-horizon robotic tasks. Prior approaches rely on imitation learning or policy gradients, which suffer from compounding errors, sample inefficiency, or instability at scale. Their main contribution is GigaBrain-0.5M*, a VLA model built atop GigaBrain-0.5 that integrates RAMP—a reinforcement learning framework conditioned on world model predictions—to enable self-improvement through human-in-the-loop rollouts. This yields ~30% gains over RECAP baselines and enables reliable real-world execution of complex tasks like laundry folding and espresso preparation.

Method

The authors leverage a unified end-to-end Vision-Language-Action (VLA) architecture, GigaBrain-0.5, as the foundational policy model, which maps multimodal inputs—visual observations and natural language instructions—to sequences of robot actions. This model employs a mixture-of-transformers backbone, integrating a pre-trained PaliGemma-2 vision-language encoder for input representation and an action Diffusion Transformer (DiT) with flow matching for action chunk prediction. To enhance reasoning, GigaBrain-0.5 generates an Embodied Chain-of-Thought (Embodied CoT), comprising autoregressive subgoal language, discrete action tokens, and 2D manipulation trajectories t1:10\mathbf{t}_{1:10}t1:10. While language and discrete tokens are decoded via the VLM head, the 2D trajectory is regressed from learnable tokens through a lightweight GRU decoder. Depth and 2D trajectory information are treated as optional states, enabling adaptation to heterogeneous sensor modalities. All components are jointly optimized under a unified objective that balances CoT token prediction, diffusion-based action denoising, and trajectory regression:

L=EP,τ,ϵ[j=1n1MCoT,jlogpθ(xj+1x1:j)+ϵachunkfθ(achunkτ,ϵ)2+λGRU(t^1:10)t1:102],\mathcal { L } = \mathbb { E } _ { \mathcal { P } , \tau , \epsilon } \left[ - \sum _ { j = 1 } ^ { n - 1 } M _ { \mathrm { C o T } , j } \log p _ { \theta } \left( x _ { j + 1 } \mid x _ { 1 : j } \right) + \left\| \epsilon - a _ { \mathrm { c h u n k } } - f _ { \theta } \left( a _ { \mathrm { c h u n k } } ^ { \tau , \epsilon } \right) \right\| ^ { 2 } + \lambda \left\| \mathbf { G R U } ( \hat { \mathbf { t } } _ { 1 : 1 0 } ) - \mathbf { t } _ { 1 : 1 0 } \right\| ^ { 2 } \right] ,L=EP,τ,ϵ[j=1n1MCoT,jlogpθ(xj+1x1:j)+ϵachunkfθ(achunkτ,ϵ)2+λGRU(t^1:10)t1:102],

where MCoT,jM_{\text{CoT},j}MCoT,j masks CoT tokens, τ\tauτ is the flow-matching timestep, ϵ\epsilonϵ is Gaussian noise, and λ\lambdaλ balances trajectory loss. Knowledge Insulation ensures decoupled optimization between language and action prediction terms.

To further refine policy behavior, the authors introduce RAMP (Reinforcement leArning via world Model-conditioned Policy), a four-stage iterative training framework that integrates world model predictions to guide policy learning through experience and corrective feedback. As shown in the figure below, RAMP begins with World Model Pre-training, where a latent dynamics model Wϕ\mathcal{W}_{\phi}Wϕ is trained to predict future visual states and value estimates from current observations and actions. The world model uses a Wan2.2 DiT backbone and is trained via flow matching on 4K hours of real robot manipulation data. Future visual states are encoded into spatiotemporal latents zt\mathbf{z}_tzt, while scalar signals (value vtv_tvt, proprioception pt\mathbf{p}_tpt) are spatially tiled and concatenated to form a unified latent state st=[zt;Ψ(vt);Ψ(pt)]\mathbf{s}_t = [\mathbf{z}_t; \Psi(v_t); \Psi(\mathbf{p}_t)]st=[zt;Ψ(vt);Ψ(pt)]. The training objective minimizes the squared error between the model’s denoised output and the ground-truth latent state:

LWM=ED,τ,ϵ[Wϕ(sfutureτ,ϵ)(sfutureϵ)2].\mathcal { L } _ { \mathrm { W M } } = \mathbb { E } _ { \mathcal { D } , \tau , \epsilon } \left[ \left\| \mathcal { W } _ { \phi } ( \mathbf { s } _ { \mathrm { f u t u r e } } ^ { \tau , \epsilon } ) - ( \mathbf { s } _ { \mathrm { f u t u r e } } - \epsilon ) \right\| ^ { 2 } \right] .LWM=ED,τ,ϵ[Wϕ(sfutureτ,ϵ)(sfutureϵ)2].

In Stage 2, the GigaBrain-0.5 policy is fine-tuned with world model conditioning. The policy receives future state tokens zfuture\mathbf{z}_{\text{future}}zfuture and value estimates vtv_tvt, which are projected and converted into binary advantage indicators I=1(A(st,at)>ϵ)I = \mathbb{1}(A(\mathbf{s}_t, a_t) > \epsilon)I=1(A(st,at)>ϵ) via n-step TD estimation:

A(st,at)=k=0n1γkrt+k+γnvt+nvt.A ( \mathbf { s } _ { t } , a _ { t } ) = \sum _ { k = 0 } ^ { n - 1 } \gamma ^ { k } r _ { t + k } + \gamma ^ { n } v _ { t + n } - v _ { t } .A(st,at)=k=0n1γkrt+k+γnvt+nvt.

The policy is trained to minimize the weighted negative log-likelihood of action generation conditioned on (I,z)(I, \mathbf{z})(I,z), as defined in the RAMP objective:

L(θ)=ED[logπθ(ao,z,l)αlogπθ(aI,o,zt,l)].\mathcal { L } ( \theta ) = \mathbb { E } _ { D } \left[ - \log \pi _ { \theta } ( a | \mathbf { o } , \mathbf { z } , l ) - \alpha \log \pi _ { \theta } ( a | I , \mathbf { o } , \mathbf { z } _ { t } , l ) \right] .L(θ)=ED[logπθ(ao,z,l)αlogπθ(aI,o,zt,l)].

To ensure robustness, stochastic attention masking randomly suppresses world model tokens with 20% probability during training, preventing over-reliance on synthetic signals.

Stage 3 deploys the policy for Human-in-the-Loop Rollout (HILR) data collection, where autonomous execution is interleaved with expert interventions. A smoothing mechanism removes temporal artifacts at intervention boundaries, preserving trajectory coherence. The resulting dataset combines native policy actions with expert corrections, reducing action distribution gaps compared to teleoperation.

In Stage 4, the policy is continually fine-tuned on the HILR dataset, while the world model is jointly updated to prevent advantage collapse. Stochastic masking is maintained to ensure training-inference consistency. The iterative rollout-annotation-training loop enables progressive policy improvement: as the policy becomes more capable, its autonomous rollouts generate higher-quality data for subsequent training cycles.

During inference, the authors enforce an optimistic control strategy by fixing I=1I=1I=1. Two execution modes are supported: an efficient mode that bypasses the world model for maximum inference frequency, and a standard mode that leverages predicted future states z\mathbf{z}z for long-horizon planning. This architectural decoupling, enabled by stochastic masking, ensures flexible deployment across varying computational constraints.

Experiment

  • GigaBrain-0.5 excels in complex, long-horizon robotic tasks like box packing and coffee preparation, outperforming prior models including π₀.5 and GigaBrain-0 across internal evaluations and the RoboChallenge benchmark.
  • RAMP, the world model-based reinforcement learning method, demonstrates superior sample efficiency and multi-task generalization compared to baselines like AWR and RECAP, achieving near-perfect success in challenging real-world tasks.
  • Ablation studies confirm the critical role of joint value and future state prediction in the world model, enhancing policy accuracy and task success while maintaining efficient inference.
  • World model conditioning significantly boosts performance in both single-task and multi-task settings, with the largest gains observed in multi-task training, indicating strong cross-task knowledge transfer.
  • Real-world deployments on PiPER arms and G1 humanoid robots validate robust, reliable execution across diverse manipulation tasks, including juice prep, laundry folding, and table bussing.

The authors compare value prediction methods and find that their world model-based approach, which jointly predicts future states and values, achieves the best balance of accuracy and speed. While a value-only world model variant is faster, it sacrifices prediction quality, and the VLM-based method is slower despite comparable accuracy. Results show that incorporating future state context significantly improves value estimation reliability.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp