StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

Vision-and-Language Navigation (VLN) in real-world settings requires agentsto process continuous visual streams and generate actions with low latencygrounded in language instructions. While Video-based Large Language Models(Video-LLMs) have driven recent progress, current VLN methods based onVideo-LLM often face trade-offs among fine-grained visual understanding,long-term context modeling and computational efficiency. We introduceStreamVLN, a streaming VLN framework that employs a hybrid slow-fast contextmodeling strategy to support multi-modal reasoning over interleaved vision,language and action inputs. The fast-streaming dialogue context facilitatesresponsive action generation through a sliding-window of active dialogues,while the slow-updating memory context compresses historical visual statesusing a 3D-aware token pruning strategy. With this slow-fast design, StreamVLNachieves coherent multi-turn dialogue through efficient KV cache reuse,supporting long video streams with bounded context size and inference cost.Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance withstable low latency, ensuring robustness and efficiency in real-worlddeployment. The project page is:https://streamvln.github.io/{https://streamvln.github.io/}.