Towards Learning a Generalist Model for Embodied Navigation

Building a generalist agent that can interact with the world is theintriguing target of AI systems, thus spurring the research for embodiednavigation, where an agent is required to navigate according to instructions orrespond to queries. Despite the major progress attained, previous worksprimarily focus on task-specific agents and lack generalizability to unseenscenarios. Recently, LLMs have presented remarkable capabilities across variousfields, and provided a promising opportunity for embodied navigation. Drawingon this, we propose the first generalist model for embodied navigation,NaviLLM. It adapts LLMs to embodied navigation by introducing schema-basedinstruction. The schema-based instruction flexibly casts various tasks intogeneration problems, thereby unifying a wide range of tasks. This approachallows us to integrate diverse data sources from various datasets into thetraining, equipping NaviLLM with a wide range of capabilities required byembodied navigation. We conduct extensive experiments to evaluate theperformance and generalizability of our model. The experimental resultsdemonstrate that our unified model achieves state-of-the-art performance onCVDN, SOON, and ScanQA. Specifically, it surpasses the previousstats-of-the-art method by a significant margin of 29% in goal progress onCVDN. Moreover, our model also demonstrates strong generalizability andpresents impressive results on unseen tasks, e.g., embodied question answeringand 3D captioning.