Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning

Modern robot navigation systems encounter difficulties in diverse and complexindoor environments. Traditional approaches rely on multiple modules with smallmodels or rule-based systems and thus lack adaptability to new environments. Toaddress this, we developed Astra, a comprehensive dual-model architecture,Astra-Global and Astra-Local, for mobile robot navigation. Astra-Global, amultimodal LLM, processes vision and language inputs to perform self and goallocalization using a hybrid topological-semantic graph as the global map, andoutperforms traditional visual place recognition methods. Astra-Local, amultitask network, handles local path planning and odometry estimation. Its 4Dspatial-temporal encoder, trained through self-supervised learning, generatesrobust 4D features for downstream tasks. The planning head utilizes flowmatching and a novel masked ESDF loss to minimize collision risks forgenerating local trajectories, and the odometry head integrates multi-sensorinputs via a transformer encoder to predict the relative pose of the robot.Deployed on real in-house mobile robots, Astra achieves high end-to-end missionsuccess rate across diverse indoor environments.