HyperAIHyperAI

Command Palette

Search for a command to run...

Guava: Ein effektiver und universeller Rahmen für verkörperte Manipulation

Haowen Liu Xirui Li Shaoxiong Yao Peng Shi Tianyi Zhou Jia-Bin Huang Furong Huang Jiayuan Mao

Zusammenfassung

Sprachmodelle, die auf großskaligen Vision-Language-Daten trainiert wurden, haben ein starkes Potenzial für embodied agents demonstriert. Die Einbindung von Modellen durch den Einsatz verkörperter Werkzeuge bietet eine vielversprechende Alternative zu Ende-zu-Ende-Vision-Language-Action-Systemen, indem hochrangiges Reasoning mit externen Modulen für Wahrnehmung, Planung und Steuerung kombiniert wird. Dennoch bleibt unklar, was ein effektives Harness für verkörperte Manipulation ausmacht und inwieweit ein solches Harness verkörperte Fähigkeiten in einer breiten Palette von Reasoning-Modellen freisetzen kann. In dieser Arbeit präsentieren wir Guava, ein Harness-Framework für den Einsatz verkörperter Werkzeuge, das durch eine systematische Erkundung des Designraums von agent Workflows, Aktionsräumen und Beobachtungsräumen entwickelt wurde. Unsere Studie identifiziert drei Schlüsselkomponenten für effektive embodied agents: iterative Perception-Reasoning-Action-Schleifen, semantische Aktionsabstraktionen und multimodale Beobachtungen. Um zu untersuchen, ob diese Designprinzipien auch für kleine Modelle universell gelten, entwickeln wir eine Ende-zu-Ende-Trainingspipeline, die Fähigkeiten zur verkörperten Manipulation in ein 4B-Open-Source-Modell distilliert, wobei weniger als 2K Trajektorien verwendet werden, die ausschließlich in einer Simulation gesammelt wurden. Experimentelle Ergebnisse sowohl in Simulations- als auch in Realweltumgebungen zeigen eine Leistung, die mit modernsten proprietären Modellen vergleichbar ist, während gleichzeitig eine starke Generalisierung auf ungesehene Objekte, neue Anweisungen und Aufgaben mit langem Horizont gezeigt wird. Die Ergebnisse deuten darauf hin, dass ein gut konzipiertes Harness als skalierbare, modellagnostische Schnittstelle für verkörperte Manipulation dienen kann und so in kompakten Open-Source-Modellen mit minimalem Trainingsdatenbestand starke emergente verkörperte Fähigkeiten ermöglicht.

One-sentence Summary

Guava is a harness framework for embodied manipulation that leverages iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations to distill capabilities into a 4B open-source model using fewer than 2K simulation trajectories, achieving performance comparable to frontier proprietary models with strong generalization to unseen objects, novel instructions, and long-horizon tasks across both simulation and real-world environments.

Key Contributions

  • This work introduces Guava, a modular harness framework for embodied tool use that systematically explores agent workflows, action spaces, and observation spaces to bridge high-level reasoning with external perception and control modules.
  • The study identifies three core design principles for effective manipulation: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations, which enable explicit plan inspection and continuous failure recovery.
  • A data-efficient training pipeline distills these capabilities into a 4B open-source model using fewer than 2,000 simulation trajectories, achieving performance comparable to frontier proprietary models while generalizing to unseen objects, novel instructions, and long-horizon tasks in both simulation and real-world environments.

Introduction

Large vision-language models offer a promising foundation for embodied manipulation, yet end-to-end vision-language-action policies face significant hurdles regarding data efficiency and scalability across diverse environments. Existing harness-based methods often depend on one-shot program generation or specialized pipelines, which restricts robust long-horizon planning and failure recovery while requiring costly frontier models. The authors leverage these observations to develop Guava, a universal harness framework that optimizes embodied tool use through iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. By integrating this architecture with a data-efficient training pipeline, they show that compact open-source models can match frontier performance, demonstrating strong generalization and real-world transfer using fewer than 2,000 simulation trajectories.

Dataset

  • Dataset Composition and Sources: The authors construct the dataset by deploying the Guava harness framework with GPT-5.4 in the RoboSuite simulation environment. A standardized API enables closed loop interaction by exposing scene observations, action execution, and episode level feedback to the model.

  • Subset Details:

    • Success Trajectories: Account for 1,191 episodes (62 percent of the dataset) generated from 237 unique task prompts. The authors randomize environmental parameters like pose, lighting, and camera views during generation to boost diversity.
    • Recovery Trajectories: Comprise the remaining 743 episodes (38 percent). These are created by injecting predefined simulation errors such as missed grasps, dropped objects, or misalignments into successful rollouts, or by sampling random intermediate states to force the model to recover.
  • Data Usage and Mixture Ratios: The authors fine-tune Guava-Agent-4B using the complete curated set. They structure the training mixture with a 62 percent success to 38 percent recovery ratio to balance baseline performance with error recovery capabilities.

  • Processing and Quality Control: The authors apply a multi-stage curation pipeline to ensure quality and reduce bias. Initial filtering automatically removes episodes with invalid tool parameters or poor simulation initialization. A manual review step eliminates low-quality samples containing unrelated dialogue or excessive self-reflection. Finally, the authors deduplicate highly similar trajectories to prevent overfitting to specific prompts or execution patterns and apply the same filtering rules to the recovery data.

Method

The authors introduce Guava, a harness framework that transforms embodied manipulation from an open-loop prediction problem into a grounded, closed-loop interaction process. This framework enables robust performance by integrating iterative reasoning, semantic action abstractions, and multimodal observations. The system operates through perception-reasoning-action loops where the model continuously updates its plan based on new observations, allowing it to recover from grasp failures and state deviations.

A critical component of the design is the semantic action space, which delegates low-level geometric and physical reasoning to lower-level controllers. Rather than outputting raw joint coordinates, the VLM issues task-oriented commands using a set of defined tools. These tools include high-level actions such as grasp(object) and release(), as well as positioning primitives like align(object, position, clearance). The align function accepts a position parameter from the set {top,left,right,front,back}\{top, left, right, front, back\}{top,left,right,front,back} and a clearance parameter from {small,medium,large}\{small, medium, large\}{small,medium,large}, allowing the agent to reason about object relationships without managing precise 3D coordinates directly. This abstraction significantly improves performance compared to low-level interfaces that require explicit geometric planning.

To transfer these embodied capabilities to a compact open-source model, the authors develop a data-efficient training pipeline that distills behaviors from frontier VLMs. This process begins with a data generation engine that collects interaction trajectories in a simulation environment. The engine leverages scene randomization and targeted error perturbations to generate diverse examples, including not only successful completions but also recovery trajectories where the model learns to correct execution failures.

The training pipeline employs a two-stage approach to optimize the policy. First, supervised fine-tuning is performed on the collected dataset, which combines successful demonstrations with recovery scenarios to teach both manipulation skills and error correction. Following this, Group Relative Policy Optimization (GRPO) is applied using a sparse task-success reward. This reinforcement learning stage is strategically focused on the most challenging long-horizon tasks to improve sequential planning and adaptation without incurring excessive computational costs across simpler tasks.

Experiment

The evaluation setup tests a distilled 4B-parameter VLM across diverse in-distribution and out-of-distribution long-horizon manipulation tasks in both Robosuite simulation and a physical Franka robot arm. The experimental program validates that this compact model matches frontier proprietary systems in real-world deployment, demonstrating that embodied tool-use behaviors can be effectively transferred from minimal simulation data. Further ablations confirm that reinforcement learning post-training and continuous closed-loop execution are critical for robust long-horizon reasoning, error recovery, and emergent state awareness, while also exposing persistent limitations in precise spatial understanding. Overall, the results conclude that agentic planning successfully bridges simulation and reality by decoupling high-level semantic reasoning from low-level control.

The chart compares the token consumption per episode between GPT-5.4 and Guava-Agent-4B across a variety of manipulation tasks. Guava-Agent-4B generally requires fewer tokens to complete tasks than the GPT-5.4 baseline, with the difference being most significant in the overall average. This indicates that the compact model achieves comparable behaviors with substantially reduced computational overhead. Guava-Agent-4B utilizes fewer tokens per episode than GPT-5.4 in the majority of individual tasks. The overall token efficiency of Guava-Agent-4B is substantially higher than that of the GPT-5.4 model. The efficiency gap persists across diverse task complexities, despite minor variations in specific long-horizon scenarios.

The experiment evaluates the Guava harness with different base models across various manipulation tasks. The results indicate that larger frontier models maintain high success rates across in-distribution and out-of-distribution scenarios, while the smaller model exhibits poor performance due to issues with instruction following and tool calling. Larger frontier models demonstrate robust capability in handling diverse task categories, including complex long-horizon sequences. The smaller model consistently fails across all task types, highlighting limitations in instruction following and tool selection. Gemini-3.1-Pro generally achieves the highest overall success rate compared to the other evaluated models.

The experiments evaluate the Guava framework for embodied manipulation, demonstrating that closed-loop ReAct planning significantly outperforms single-turn approaches. High-level toolsets generally improve performance on complex tasks like picking up objects, while low-level tools are more effective for specific actions such as pushing. Furthermore, visual input is critical, with image-only inputs achieving superior results on certain spatial tasks compared to multimodal text-and-image inputs. Closed-loop ReAct execution significantly outperforms single-turn planning in overall success rates. High-level toolsets enhance performance on tasks like picking up oranges and removing cubes, while low-level tools are better for pushing. Image-only inputs achieve superior performance on specific spatial tasks compared to multimodal text-and-image inputs.

The experiment evaluates Guava-Agent-4B against several baselines across a suite of embodied manipulation tasks. Results indicate that Guava-Agent-4B achieves the highest overall success rate, generally outperforming the proprietary GPT-5.4 model and the concurrent CaP-Agent0. The base Qwen3.5-4B model shows significantly lower performance, underscoring the effectiveness of the proposed distillation harness. Guava-Agent-4B demonstrates superior overall performance compared to both proprietary and open-source baselines across various task categories. The model effectively generalizes to out-of-distribution scenarios, maintaining high success rates on tasks involving unseen objects and novel instructions. Significant performance gaps are observed between the trained Guava-Agent-4B and the untrained base Qwen3.5-4B model, highlighting the impact of the training process.

The experiments evaluate the Guava framework and its distilled agent across diverse embodied manipulation tasks, validating the effectiveness of the training harness, planning architecture, tool selection strategies, and computational efficiency. Qualitative assessments demonstrate that closed-loop ReAct planning consistently outperforms single-turn approaches, while specialized high- and low-level toolsets effectively optimize complex object manipulation and precise physical interactions. Additionally, the distilled model achieves robust generalization to unseen scenarios and substantially reduces computational overhead compared to larger proprietary baselines, with image-only inputs proving highly effective for spatial reasoning. Overall, the findings confirm that the proposed framework enables compact models to deliver superior task success and efficiency without relying on massive frontier architectures.


KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp