HyperAIHyperAI

Command Palette

Search for a command to run...

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng Yongyuan Liang Shuaiyi Huang Jianfeng Gao Hal Daumé III Andrey Kolobov Furong Huang Jianwei Yang

Abstract

Although large vision-language-action (VLA) models pretrained on extensiverobot datasets offer promising generalist policies for robotic learning, theystill struggle with spatial-temporal dynamics in interactive robotics, makingthem less effective in handling complex tasks, such as manipulation. In thiswork, we introduce visual trace prompting, a simple yet effective approach tofacilitate VLA models' spatial-temporal awareness for action prediction byencoding state-action trajectories visually. We develop a new TraceVLA model byfinetuning OpenVLA on our own collected dataset of 150K robot manipulationtrajectories using visual trace prompting. Evaluations of TraceVLA across 137configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstratestate-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and3.5x on real-robot tasks and exhibiting robust generalization across diverseembodiments and scenarios. To further validate the effectiveness and generalityof our method, we present a compact VLA model based on 4B Phi-3-Vision,pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7BOpenVLA baseline while significantly improving inference efficiency.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp