HyperAIHyperAI

Command Palette

Search for a command to run...

LongFly : Navigation vision-langage pour UAV à horizon long avec intégration du contexte spatio-temporel

Wen Jiang Li Wang Kangyao Huang Wei Fan Jinyuan Liu Shaoyu Liu Hongwei Duan Bin Xu Xiangyang Ji

Abstract

Les véhicules aériens sans pilote (UAV) constituent des outils essentiels pour les opérations de recherche et de sauvetage après une catastrophe, confrontés à des défis tels qu’une densité élevée d’informations, des changements rapides de perspective et des structures dynamiques, notamment dans les missions de navigation à long terme. Toutefois, les méthodes actuelles de navigation vision-langage pour UAV (VLN) peinent à modéliser le contexte spatio-temporel à long terme dans des environnements complexes, entraînant une alignement sémantique inexact et une planification de trajectoire instable. À cet effet, nous proposons LongFly, un cadre de modélisation du contexte spatio-temporel pour la navigation VLN à long horizon des UAV. LongFly introduit une stratégie de modélisation spatio-temporelle consciente de l’historique, qui transforme les données historiques fragmentées et redondantes en représentations structurées, compactes et expressives. Premièrement, nous proposons un module de compression d’images historiques basé sur des emplacements (slot-based), qui extrait dynamiquement des observations multi-vues de l’historique pour en produire des représentations contextuelles de longueur fixe. Ensuite, nous introduisons un module d’encodage de trajectoire spatio-temporelle, conçu pour capturer les dynamiques temporelles et la structure spatiale des trajectoires des UAV. Enfin, afin d’intégrer le contexte spatio-temporel existant aux observations actuelles, nous concevons un module d’intégration multimodale guidée par des prompts, permettant une raisonement temporel robuste et une prédiction fiable des points de passage (waypoints). Les résultats expérimentaux montrent que LongFly surpasser les meilleures méthodes de navigation VLN pour UAV existantes, avec une amélioration de 7,89 % en taux de réussite et de 6,33 % en taux de réussite pondéré par la longueur du trajet, dans des environnements connus comme dans des environnements inconnus.

One-sentence Summary

The authors, affiliated with institutions in China including the National Natural Science Foundation of China, Chongqing Natural Science Foundation, and the National High Technology Research and Development Program, propose LongFly, a spatiotemporal context modeling framework for long-horizon UAV vision-and-language navigation that integrates history-aware visual compression, trajectory encoding, and prompt-guided multimodal fusion. By dynamically distilling multi-view historical observations into compact semantic slots and aligning them with language instructions through a structured prompt, LongFly enables robust, time-aware waypoint prediction in complex 3D environments, achieving 7.89% higher success rate and 6.33% better success weighted by path length than state-of-the-art methods across seen and unseen scenarios.

Key Contributions

  • LongFly addresses the challenge of long-horizon UAV vision-and-language navigation in complex, dynamic environments by introducing a unified spatiotemporal context modeling framework that enables stable, globally consistent decision-making despite rapid viewpoint changes and high information density.

  • The method features a slot-based historical image compression module that dynamically distills multi-view past observations into compact, fixed-length representations, and a spatiotemporal trajectory encoding module that captures both temporal dynamics and spatial structure of UAV flight paths.

  • Experimental results show LongFly achieves 7.89% higher success rate and 6.33% higher success weighted by path length than state-of-the-art baselines across both seen and unseen environments, demonstrating robust performance in long-horizon navigation tasks.

Introduction

The authors address long-horizon vision-and-language navigation (VLN) for unmanned aerial vehicles (UAVs), a critical capability for post-disaster search and rescue, environmental monitoring, and geospatial data collection in complex, GPS-denied environments. While prior UAV VLN methods have made progress in short-range tasks, they struggle with long-horizon navigation due to fragmented, static modeling of historical visual and trajectory data, leading to poor semantic alignment and unstable path planning. Existing approaches often treat history as isolated memory cues without integrating them into a unified spatiotemporal context aligned with language instructions and navigation dynamics. To overcome this, the authors propose LongFly, a spatiotemporal context modeling framework that dynamically compresses multi-view historical images into compact, instruction-relevant representations via a slot-based compression module, encodes trajectory dynamics through a spatiotemporal trajectory encoder, and fuses multimodal context with current observations using a prompt-guided integration module. This enables robust, time-aware reasoning and consistent waypoint prediction across long sequences, achieving 7.89% higher success rate and 6.33% better success weighted by path length than state-of-the-art baselines in both seen and unseen environments.

Method

The authors leverage a spatiotemporal context modeling framework named LongFly to address the challenges of long-horizon UAV visual-language navigation (VLN). The overall architecture integrates three key modules to transform fragmented historical data into structured, compact representations that support robust waypoint prediction. The framework begins by processing the current command instruction and the UAV's current visual observation, which are tokenized and projected into a shared latent space. Concurrently, historical multi-view images and waypoint trajectories are processed through dedicated modules to generate compressed visual and motion representations.

The first module, Slot-based Historical Image Compression (SHIC), addresses the challenge of efficiently storing and retrieving long-horizon visual information. It processes the sequence of historical multi-view images R1,R2,,Rt1R_1, R_2, \ldots, R_{t-1}R1,R2,,Rt1 using a CLIP-based visual encoder Fv\mathcal{F}_vFv to extract visual tokens ZiZ_iZi at each time step. These tokens are then used to update a fixed-capacity set of learnable visual memory slots SiS_iSi. The update mechanism treats each slot as a query and the visual tokens as keys and values, computing attention weights to perform a weighted aggregation of the new visual features. This process is implemented using a gated recurrent unit (GRU) to update the slot memory, resulting in a compact visual memory representation St1S_{t-1}St1 that captures persistent landmarks and spatial layouts. This approach reduces the memory and computational complexity from O(t)O(t)O(t) to O(1)O(1)O(1).

The second module, Spatio-temporal Trajectory Encoding (STE), models the UAV's motion history. It takes the historical waypoint sequence P1,P2,,Pt1P_1, P_2, \ldots, P_{t-1}P1,P2,,Pt1 and transforms the absolute coordinates into relative motion representations. For each step, the displacement vector ΔPi\Delta P_iΔPi is computed, which is then decomposed into a unit direction vector di\mathbf{d}_idi and a motion scale rir_iri. These are concatenated to form a 4D motion descriptor MiM_iMi. To encode temporal ordering, a time embedding τi\tau_iτi is added, resulting in a time-aware motion representation M~i\widetilde{M}_iMi. This representation is then projected into a ddd-dimensional trajectory token tit_iti using a residual MLP encoder, producing a sequence of trajectory tokens Tt1T_{t-1}Tt1 that serve as an explicit motion prior.

The third module, Prompt-Guided Multimodal Integration (PGM), integrates the historical visual memory, trajectory tokens, and the current instruction and observation into a structured prompt for the large language model. The natural language instruction LLL is encoded using a BERT encoder and projected into a unified latent dimension. The compressed visual memory St1S_{t-1}St1 and trajectory tokens Tt1T_{t-1}Tt1 are also projected into the same space. These components, along with the current visual observation RtR_tRt, are organized into a structured prompt that includes the task instruction, a Qwen-compatible conversation template, and UAV history status information. This prompt is then fed into a large language model (Qwen2.5-3B) to predict the next 3D waypoint Pt+1P_{t+1}Pt+1 in continuous space. This design enables coherent long-horizon multimodal reasoning without requiring additional feature-level fusion mechanisms.

Experiment

  • LongFly demonstrates superior performance on the OpenUAV benchmark, achieving 33.03m lower NE, 7.22% higher SR, and over 6.04% improvement in OSR and SPL compared to baselines on the seen dataset, with the largest gains on the Hard split.
  • On the unseen object set, LongFly achieves 43.87% SR and 64.56% OSR, outperforming NavFoM by 14.04% in SR and 16.57% in OSR, with significant gains in NE and SPL on the Hard subset.
  • On the unseen map set, LongFly attains 24.88% OSR and 7.98% SPL in the Hard split, the only method to maintain reasonable performance, while others fail (OSR ≈ 0), highlighting its robustness to novel layouts.
  • Ablation studies confirm that both SHIC and STE modules are essential, with their combination yielding the best results; prompt-guided fusion and longer history lengths significantly improve performance, especially in long-horizon tasks.
  • SHIC slot number analysis shows optimal performance at K=32, with improvements in SR, SPL, and NE as slots increase.
  • Qualitative results demonstrate LongFly’s ability to maintain global consistency and avoid local traps through spatiotemporal context integration, unlike the baseline that drifts due to myopic reasoning.

Results show that LongFly significantly outperforms all baseline methods across unseen environments, achieving the lowest NE and highest SR, OSR, and SPL. The model demonstrates robust generalization, particularly in unseen object and map settings, with the largest gains observed in challenging long-horizon scenarios.

Results show that LongFly significantly outperforms all baseline methods across all difficulty levels, achieving the lowest NE and highest SR, OSR, and SPL. On the Full split, LongFly reduces NE by 29.39 compared to the baseline BS and improves SR by 20.03 percentage points, demonstrating its effectiveness in long-horizon navigation.

Results show that the model achieves the best performance at a learning rate of 5 × 10⁻⁴, with the highest success rate (SR) of 24.19% and the highest SPL of 20.84%, while maintaining a low NE of 91.84. Performance remains stable across different learning rates, with only minor variations in SR, OSR, and SPL, indicating robustness to learning rate changes.

Results show that LongFly with prompt-guided fusion achieves significantly better performance than the version without prompts, reducing NE from 102.45 to 91.84 and increasing SR, OSR, and SPL. The model with all-frame history performs as well as the 60-frame version, indicating that longer history provides diminishing returns, while prompt guidance is essential for aligning spatiotemporal context with instructions.

The authors conduct an ablation study on the number of SHIC slots, showing that increasing the slot count from 8 to 32 improves performance across all metrics. With 32 slots, the model achieves the best results, reducing NE to 91.84, increasing SR to 24.19%, OSR to 43.86%, and SPL to 20.84%.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp