HyperAIHyperAI

Command Palette

Search for a command to run...

ABot-M0:アクション多様体学習を用いたロボット操作向けVLA基盤モデル

概要

多様なハードウェアにわたる汎用的エミベデッドエージェント(身体を持つエージェント)の構築は、ロボティクスにおける中心的な課題であり、しばしば「一つの脳、多数の形態(one-brain, many-forms)」と呼ばれるパラダイムとして捉えられている。この分野の進展は、データの断片化、表現の不整合、および訓練目的の不一致によって阻害されている。本研究では、ABot-M0と呼ばれるフレームワークを提案する。このフレームワークは、システム的なデータ収集パイプラインを構築しつつ、モデルアーキテクチャと学習戦略を統合的に最適化することで、異種の原始データを一貫性があり効率的な表現にまでエンドツーエンドで変換することを可能にする。6つの公開データセットからサンプルをクリーニングし、標準化・バランス調整することで、600万件以上のトラジェクトリと9,500時間以上のデータを含む大規模データセット「UniACT-dataset」を構築した。このデータセットは、多様なロボットの形状とタスクシナリオをカバーしている。統一的な事前学習により、異なるプラットフォームやタスク間での知識の転移性と一般化性能が向上し、汎用的エミベデッドインテリジェンスの実現を支援する。 行動予測の効率性と安定性を向上させるために、我々は「行動多様体仮説(Action Manifold Hypothesis)」を提唱する。この仮説は、有効なロボットの行動が高次元空間全体に存在するのではなく、物理法則およびタスク制約によって制御される低次元で滑らかな多様体上に存在することを示している。この仮説に基づき、行動多様体学習(Action Manifold Learning: AML)を導入する。AMLはDiT(Diffusion Transformer)をバックボーンとして用い、直接的にクリーンで連続的な行動シーケンスを予測する。これにより、従来のノイズ除去学習から、実現可能な多様体への射影学習へと転換され、デコード速度とポリシーの安定性が向上する。 ABot-M0は、双方向ストリーム機構を用いたモジュール型の感覚処理をサポートする。この機構は、VLM(視覚言語モデル)の意味情報を幾何学的事前知識およびプラグアンドプレイ型の3Dモジュール(例:VGGT、Qwen-Image-Edit)からのマルチビュー入力と統合することで、バックボーンを変更せずに空間認識能力を強化し、従来のVLMにおける3D推論の限界を緩和する。実験結果から、各モジュールが独立して動作し、相互に加算的な効果をもたらすことが確認された。本研究では、再現性および今後の研究を促進するため、すべてのコードおよびパイプラインを公開する予定である。

One-sentence Summary

The AMAP CV Lab team introduces ABot-M0, a unified framework leveraging Action Manifold Learning and UniACT-dataset to enable efficient, general-purpose embodied agents across diverse robots; it shifts action prediction to low-dimensional manifolds, integrates modular 3D perception, and boosts stability without altering backbones.

Key Contributions

  • ABot-M0 introduces a unified framework that standardizes heterogeneous robotic datasets into UniACT-dataset (6M+ trajectories, 9.5K hours) and jointly optimizes architecture and training to enable cross-embodiment generalization without custom hardware or proprietary data.
  • It proposes the Action Manifold Hypothesis and Action Manifold Learning (AML), using a DiT backbone to directly predict clean, continuous actions on a low-dimensional manifold, improving decoding speed and policy stability over traditional denoising methods.
  • The framework integrates modular 3D perception via plug-and-play modules like VGGT and Qwen-Image-Edit into a dual-stream VLM architecture, enhancing spatial reasoning while maintaining backbone integrity, and achieves state-of-the-art results on Libero, RoboCasa, and RoboTwin benchmarks.

Introduction

The authors leverage a unified framework to tackle the challenge of building general-purpose robotic agents that work across diverse hardware, a key goal in embodied AI. Prior efforts struggle with fragmented datasets, inconsistent representations, and misaligned training, which limit cross-platform generalization. Their main contribution is ABot-M0, which combines a large curated dataset (UniACT, 6M+ trajectories) with a novel Action Manifold Learning module that predicts actions as smooth, low-dimensional sequences—improving efficiency and stability over traditional denoising methods. They also introduce a dual-stream perception system that integrates VLM semantics with plug-and-play 3D modules, enhancing spatial reasoning without modifying the backbone. Experiments show state-of-the-art performance across multiple benchmarks, proving that high-quality embodied intelligence can emerge from systematically engineered public resources.

Dataset

The authors use a curated, large-scale dataset called UniACT-dataset to train a general-purpose Vision-Language-Action (VLA) model for embodied intelligence. Here’s how they built and used it:

  • Dataset Composition & Sources:
    They integrate six open-source datasets: OXE, OXE-AugE, Agibot-Beta, RoboCoin, RoboMind, and Galaxea. These collectively provide over 7 million raw trajectories, covering diverse robot embodiments, task types, and visual conditions. The final cleaned dataset contains over 6 million trajectories spanning 9,500+ hours and 20+ unique robot embodiments.

  • Key Subset Details:

    • OXE: Serves as the foundational single-arm dataset with broad task coverage.
    • OXE-AugE: Augments embodiment variation within single-arm setups; contributes 67% of total data.
    • Agibot-Beta & Galaxea: Provide high-quality visual observations and coherent action sequences; Agibot-Beta is downsampled to reduce embodiment bias.
    • RoboCoin & RoboMind: Prioritized for complex task planning and cross-embodiment generalization (dual-arm); contribute ~17.2% combined.
    • All datasets are cleaned and converted to LeRobot v2 format for consistency.
  • Cleaning & Filtering Rules:

    • Remove trajectories with empty, garbled, or non-English instructions; translate mixed-language prompts.
    • Discard visually degraded frames (black, blurred, occluded) or ineffective camera views.
    • Filter out abnormal trajectories (wrong length, large action deltas, mismatched frame rates).
    • Reject ambiguous or incomplete action annotations (e.g., missing dimensions, unclear rotation formats).
    • Resolve subtask-level instruction alignment by inserting frame-aligned granular guidance.
    • ~16% of trajectories are discarded during cleaning; the rest are refined and merged.
  • Standardization & Processing:

    • All actions standardized to 7D delta vectors per arm: [Δx, Δy, Δz, rotation vector, gripper]. Rotation vectors (axis-angle) replace Euler/quaternions for stability.
    • Single-arm data padded to dual-arm format (unused arm = zero), enabling unified training.
    • Model always outputs dual-arm actions but activates only relevant channels during execution.
    • Training uses multi-granularity uniform sampling to balance embodiment coverage and skill learning despite data imbalance.
  • Usage in Model Training:
    The cleaned, standardized UniACT-dataset is split into training, validation, and test sets. The model is trained end-to-end on this mixture, with sampling strategies ensuring balanced exposure across embodiments and task types. The dataset’s scale, quality, and diversity jointly support cross-embodiment generalization and precise vision-language-action alignment.

Method

The authors leverage a two-component architecture for the ABot-M0 model, designed to map multimodal perception directly to robot action generation. The framework separates vision-language understanding from action generation into a Visual Language Model (VLM) and an action expert, respectively. The VLM, implemented using Qwen3-VL, processes stacked multi-view image sequences—typically from front-facing, wrist-mounted, and top-down cameras—alongside natural language instructions. These modalities are independently tokenized and fused into a unified token sequence to enable cross-modal reasoning. The VLM outputs spatially aligned multimodal representations, which serve as contextual input to the action expert for action prediction.

Refer to the framework diagram, which illustrates the end-to-end pipeline from data preprocessing through two-stage training to action generation. The model ingests multi-view images and text, processes them through the pre-trained VLM, and routes the resulting features—optionally augmented with 3D spatial information—to the action expert. The action expert, built upon a Diffusion Transformer (DiT), predicts denoised action chunks directly rather than velocity or noise, aligning with the action manifold hypothesis. This design enables the model to focus on learning the intrinsic structure of meaningful action sequences, which are posited to lie on a low-dimensional manifold, rather than regressing high-dimensional, off-manifold targets.

The action expert operates under a flow matching paradigm but predicts the denoised action chunk A^t\hat{A}_tA^t directly, given noisy action AtτA_t^\tauAtτ, robot state qtq_tqt, and contextual features ϕt\phi_tϕt from the VLM and optional 3D modules:

A^t=Vtheta(phit,Attau,qt).\hat { A } _ { t } = V _ { \\theta } ( \\phi _ { t } , A _ { t } ^ { \\tau } , q _ { t } ) .A^t=Vtheta(phit,Attau,qt).

Although the model predicts actions, the training loss is computed on velocity to improve performance. The estimated and ground-truth velocities are derived as:

beginarrayrhatv=(hatAtAttau)/(1tau),v=(AtAttau)/(1tau)endarray\\begin{array} { r } { \\hat { v } = ( \\hat { A } _ { t } - A _ { t } ^ { \\tau } ) / ( 1 - \\tau ) , } \\\\ { v = ( A _ { t } - A _ { t } ^ { \\tau } ) / ( 1 - \\tau ) } \\end{array}beginarrayrhatv=(hatAtAttau)/(1tau),v=(AtAttau)/(1tau)endarray

The loss function is a weighted mean squared error on velocity, equivalent to a reweighted action loss:

mathcalL(theta)=mathbbEvmathrmpredvmathrmtarget2=mathbbEleft[w(tau)Vtheta(phit,Attau,qt)At2right],\\mathcal { L } ( \\theta ) = \\mathbb { E } \\| v _ { \\mathrm { p r e d } } - v _ { \\mathrm { t a r g e t } } \\| ^ { 2 } = \\mathbb { E } \\left[ w ( \\tau ) \\| V _ { \\theta } ( \\phi _ { t } , A _ { t } ^ { \\tau } , q _ { t } ) - A _ { t } \\| ^ { 2 } \\right] ,mathcalL(theta)=mathbbEvmathrmpredvmathrmtarget2=mathbbEleft[w(tau)Vtheta(phit,Attau,qt)At2right],

where w(tau)=frac1(1tau)2w(\\tau)=\\frac{1}{(1-\\tau)^2}w(tau)=frac1(1tau)2. This weighting dynamically adjusts the learning signal strength across noise levels, emphasizing fine-grained refinement as τ\tauτ approaches 1.

During inference, the model follows an ODE-based trajectory to generate actions. Starting from pure noise At0simmathcalN(0,mathbfI)A_t^0 \\sim \\mathcal{N}(0, \\mathbf{I})At0simmathcalN(0,mathbfI), it iteratively denoises the action using the predicted velocity v^\hat{v}v^ and updates the state via numerical integration:

Attau+Deltatau=Attau+Deltataucdothatv.A _ { t } ^ { \\tau + \\Delta \\tau } = A _ { t } ^ { \\tau } + \\Delta \\tau \\cdot \\hat { v } .Attau+Deltatau=Attau+Deltataucdothatv.

This approach retains the smooth trajectory generation of flow models while enabling direct action prediction at the model level.

The training process follows a two-stage paradigm. Stage 1 performs large-scale pre-training on the UniACT dataset, which contains over 6 million trajectories across 20+ embodiments. Actions are represented as delta actions in the end-effector frame, with single-arm actions as R7\mathbb{R}^7R7 and dual-arm as R14\mathbb{R}^{14}R14. A pad-to-dual-arm strategy ensures parameter sharing across embodiments, and a dual-weighted sampling strategy balances task and embodiment distributions to mitigate long-tail bias.

Stage 2 introduces supervised fine-tuning (SFT) to inject 3D spatial priors for high-precision tasks. The VLM and action expert are jointly fine-tuned with a small learning rate, dropout, and action noise perturbation to enhance robustness. This stage preserves generalization while improving performance on tasks requiring metric spatial reasoning, such as insertion or bimanual coordination.

To enhance spatial reasoning, the authors introduce a modular 3D information injection module that operates alongside the VLM. It integrates feedforward single-image 3D features (via VGGT) and implicit multi-view features (via Qwen-Image-Edit) to enrich the VLM’s semantic features with geometric priors. As shown in the figure below, these 3D features are fused with the final-layer VLM features via cross-attention before being passed to the action expert. This fusion strategy harmonizes semantic and geometric streams, enabling precise and spatially grounded action generation.

The action manifold hypothesis underpins the design of the action expert. As illustrated in the figure below, conventional velocity prediction (vvv-pred) can lead to unreasonable actions by projecting off the low-dimensional action manifold. In contrast, direct action prediction (aaa-pred) constrains the model to learn meaningful, coherent action sequences that lie on the manifold, reducing learning burden and improving action quality.

Experiment

  • Task-uniform sampling outperforms trajectory- and embodiment-uniform strategies by balancing embodiment coverage and skill diversity, leading to stronger cross-embodiment, cross-dataset, and downstream task generalization.
  • Pretraining with task-uniform sampling reduces redundancy and enhances exposure to rare skills and embodiments, improving overall model robustness without requiring strict embodiment-level balancing.
  • Final-layer VLM features, pretrained on robotics data, are most effective for action prediction—outperforming intermediate layers, multi-layer concatenation, and action-query augmentation, indicating sufficient internal alignment with action semantics.
  • Action Manifold Learning (AML), which directly predicts actions instead of noise, consistently surpasses noise-prediction paradigms across varying denoising steps and action chunk sizes, especially under high-dimensional or long-horizon conditions.
  • Injecting 3D spatial features via cross-attention improves performance on both LIBERO and LIBERO-Plus, with multi-view synthesis further enhancing robustness to camera viewpoint perturbations.
  • ABot-M0 achieves state-of-the-art results across multiple benchmarks (LIBERO, LIBERO-Plus, RoboCasa, RoboTwin2.0), demonstrating strong generalization, robustness to perturbations, and scalability to complex, high-dimensional manipulation tasks.

The authors use a task-uniform sampling strategy during pretraining to balance embodiment diversity and skill coverage, which leads to stronger cross-embodiment and cross-dataset generalization compared to trajectory- or embodiment-uniform approaches. Results show that this strategy reduces redundancy in skill sampling while improving exposure to rare embodiments, translating into better downstream performance across multiple benchmarks. The ABot-M0 model, built on this foundation, consistently outperforms prior methods in both single- and multi-arm manipulation tasks, demonstrating superior generalization and robustness.

The authors use ABot-M0 to evaluate how varying action chunk sizes and denoising steps affect performance under perturbations in the LIBERO-Plus benchmark. Results show that ABot-M0 consistently outperforms Qwen3-VL-GR00T across most perturbation dimensions, especially at smaller chunk sizes, and maintains robustness even as chunk size increases, while GR00T’s performance degrades sharply. This supports the effectiveness of direct action prediction over noise prediction in handling high-dimensional action spaces under varying conditions.

The authors evaluate how different VLM feature layers and query mechanisms affect action prediction performance after robotics pretraining. Results show that using features from the final layer without action queries consistently yields the highest success rates, indicating that deep representations already encode sufficient action-relevant semantics. Incorporating intermediate layers or action queries does not improve performance and may even degrade it, suggesting that the pre-trained model’s internal structure is optimally aligned with the action space without additional adaptation.

The authors use three sampling strategies—Trajectory-Uniform, Embodiment-Uniform, and Task-Uniform—to pretrain models on multi-embodiment robotic datasets and evaluate downstream performance on Libero Plus. Results show that Task-Uniform sampling achieves the highest overall success rate, indicating it better balances embodiment diversity and skill coverage during training. This strategy consistently outperforms the others in cross-embodiment generalization and downstream task adaptation.

The authors use a unified model trained across multiple suites to evaluate performance on the LIBERO benchmark, achieving state-of-the-art success rates across all task categories. Results show consistent superiority over prior methods, particularly in long-horizon and goal-conditioned tasks, indicating strong spatial reasoning and multi-step planning capabilities. The model’s high average performance reflects robust generalization across diverse manipulation scenarios without task-specific tuning.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています