HyperAIHyperAI

Command Palette

Search for a command to run...

Apprentissage de la navigation visuelle robotique dans les foules via des représentations de scène conscientes des intentions

Han Bao Bingyi Xia Hanjing Ye Yu Zhan Hao Cheng Baozhi Jia Wenjun Xu Jiankun Wang

Résumé

La navigation de robots en foule nécessite la capacité d'inférer les intentions humaines tout en tenant compte des contraintes structurelles de l'environnement. À l'heure actuelle, l'apprentissage par renforcement profond (DRL) constitue une méthode prometteuse pour l'apprentissage de politiques de navigation permettant de comprendre les intentions humaines. Toutefois, la plupart d'entre elles reposent sur des représentations de scène limitées, traitant les piétons comme de simples points 2D et ignorant les indices visuels riches fournis à la fois par les humains et par l'environnement. Pour répondre à cette problématique, nous introduisons iCrowdNav, une nouvelle méthode de navigation visuelle en foule reposant sur des représentations de scène conscientes des intentions, afin d'encoder le contexte comportemental et structurel à partir d'observations visuelles égocentriques. Notre méthode s'appuie sur deux composants principaux : un encodeur spatio-temporel destiné à extraire les caractéristiques d'occupation de la scène, et un Intent-Interact Former (I2^22 Former), un module basé sur l'attention qui encode les postures humaines afin d'inférer les intentions de déplacement des piétons. Ces caractéristiques sont intégrées dans un embedding d'état compact qui permet un entraînement efficace de la politique DRL. Des expériences approfondies démontrent que notre méthode obtient des performances supérieures à celles des méthodes de référence, et son déploiement en environnement réel valide la navigation en foule basée sur la vision.

One-sentence Summary

iCrowdNav enhances deep reinforcement learning for crowd navigation by leveraging intention-aware scene representations that integrate a spatio-temporal encoder for environmental occupancy and the Intent-Interact Former (I^2Former) for pedestrian pose inference from egocentric views, achieving superior performance over baselines in extensive experiments and demonstrating successful real-world deployment.

Key Contributions

  • iCrowdNav is a visual navigation framework that learns intention-aware scene representations directly from egocentric camera observations. The architecture combines a spatio-temporal encoder for extracting occupancy features with an Intent-Interact Former module to process human poses and infer pedestrian motion intentions.
  • The method replaces simplified 2D point representations with BEV features integrated with behavioral visual cues, enabling a deep reinforcement learning policy to navigate dense crowds. This design captures both environmental structural constraints and human behavioral context within a compact state embedding for effective policy training.
  • Extensive simulation benchmarks and real-world physical robot deployments demonstrate that the framework achieves improved safety and robustness compared to existing baselines. These results validate the practical viability of vision-based crowd navigation in dynamic, populated environments.

Introduction

Autonomous robot navigation in dense crowds is essential for real-world service applications but requires anticipating human behavior while navigating constrained spaces. Prior deep reinforcement learning methods typically oversimplify scene representations by treating pedestrians as low-dimensional 2D points and relying on basic occupancy maps. This approach ignores critical visual cues like body poses and environmental semantics, which limits generalization from controlled simulations to unstructured real-world environments. To address these limitations, the authors leverage a novel visual encoder within a deep reinforcement learning framework to learn intention-aware scene representations directly from egocentric RGB-D cameras. They combine a spatio-temporal encoder that extracts dense occupancy features with an attention-based module that infers pedestrian motion tendencies from 3D human poses. These enriched visual cues are fused into a compact state embedding that enables robots to navigate safely and efficiently in complex crowds with successful zero-shot sim-to-real deployment.

Dataset

  1. Dataset composition and sources The authors generate the dataset entirely through simulation using the SocNav-Gym environment built on Isaac Sim. Visual data is captured using a Clearpath Dingo robot equipped with two Intel RealSense D435 RGB-D cameras, which deliver a combined field of view of approximately 140 degrees and a depth range of 0.3 to 10 meters.

  2. Key details for each subset Training scenarios feature hallways, corners, cluttered spaces, and dense open areas. Testing scenarios span specialized indoor settings including hospitals, offices, and warehouses. The provided text does not specify exact dataset sizes, filtering thresholds, or subset mixture ratios.

  3. How the paper uses the data The authors leverage the simulated trajectories to train and evaluate social navigation policies. Each episode randomizes the robot's starting and target positions to encourage adaptation across varied navigation tasks. Pedestrian agents follow the Social Force Model and move toward fully randomized destinations to drive policy learning.

  4. Processing and environment details Natural pedestrian animations are rendered through Isaac Sim to ensure crowd interactions closely mirror real-world dynamics. The simulation does not mention explicit cropping strategies or metadata construction, relying instead on randomized episode initialization and physics-based pedestrian modeling to generate diverse navigation experiences.

Method

The authors address the challenge of vision-based robot navigation in crowded environments by formulating the task as a partially observed Markov decision process. The overall system architecture is depicted in the framework diagram, which illustrates the flow from system inputs to navigation actions. The method consists of three primary components: a feature extraction module, a feature fusion module, and a deep reinforcement learning network.

The feature extraction module processes multi-timestep RGB-D images and pedestrian poses. As shown in the detailed module diagram, the spatio-temporal encoder handles the visual inputs. It utilizes a pre-trained RGB backbone to extract features from the images. These features are then lifted to 3D space and splatted into a Bird's Eye View (BEV) representation. To capture temporal dynamics, the encoder aligns BEV features from previous time steps (t1t-1t1 and t2t-2t2) using a temporal encoder, resulting in a robust spatio-temporal BEV feature map. This encoder is kept frozen during training as it is pre-trained on external datasets.

Simultaneously, the Intent-Interact Former (I2I^2I2Former) extracts intention-aware features from the poses of surrounding pedestrians. The pose detection module identifies pedestrians and encodes their 17 joint tokens. These tokens are processed by an IntentFormer, which captures the intention of the pedestrians. The InteractFormer then integrates these intention features with the robot's internal state embedding (erobotte^t_{robot}erobott) through multi-head attention mechanisms, producing interaction-aware features.

The extracted spatio-temporal BEV features (zbevtz^t_{bev}zbevt) and the interaction features (zinteracttz^t_{interact}zinteractt) are concatenated with the robot state embedding to form a comprehensive state representation. This state embedding is fed into a DRL network based on the Proximal Policy Optimization (PPO) algorithm. The network is trained to maximize the expected cumulative reward, defined as:

L(θ)=Eπθ[t=0γtrt]L ( \theta ) = \mathbb { E } _ { \pi _ { \theta } } \left[ \sum _ { t = 0 } ^ { \infty } \gamma ^ { t } r ^ { t } \right]L(θ)=Eπθ[t=0γtrt]

where γ\gammaγ is the discount rate and rtr^trt is the reward at time ttt. The reward function is designed to encourage safe navigation, collision avoidance, and smooth trajectories. Specifically, the navigation reward rnavtr_{nav}^trnavt provides dense feedback based on the distance to the goal (dgtd_g^tdgt) and the minimum distance to obstacles or pedestrians (dotd_o^tdot). The reward logic is structured as follows:

rnavt={20,if dgtρrobot20,else if dotρrobot0.5(dot0.9),else if ρrobot<dot<0.93.2(dqt1dqt),otherwise,r _ { \mathrm { n a v } } ^ { t } = \left\{ \begin{array} { l l } { 2 0 , } & { \mathrm { i f ~ } d _ { g } ^ { t } \leq \rho _ { \mathrm { r o b o t } } } \\ { - 2 0 , } & { \mathrm { e l s e ~ i f ~ } d _ { o } ^ { t } \leq \rho _ { \mathrm { r o b o t } } } \\ { 0 . 5 ( d _ { o } ^ { t } - 0 . 9 ) , } & { \mathrm { e l s e ~ i f ~ } \rho _ { \mathrm { r o b o t } } < d _ { o } ^ { t } < 0 . 9 } \\ { 3 . 2 ( d _ { q } ^ { t - 1 } - d _ { q } ^ { t } ) , } & { \mathrm { o t h e r w i s e } , } \end{array} \right.rnavt=20,20,0.5(dot0.9),3.2(dqt1dqt),if dgtρrobotelse if dotρrobotelse if ρrobot<dot<0.9otherwise,

where ρrobot\rho_{\text{robot}}ρrobot is the radius of the robot. To address potential jitter in the policy, a trajectory-smoothing reward rωtr_\omega^trωt is also included, which penalizes excessive angular velocity. The method is evaluated in complex indoor scenarios, as illustrated in the simulation visualization, where the robot must navigate through hallways and rooms while maintaining social distance from pedestrians.

Experiment

The proposed method was evaluated through simulated crowd navigation across varying spatial constraints and densities, long-horizon topological mapping tasks, and real-world deployments in complex public environments. These experiments validate the combined effectiveness of the spatio-temporal BEV encoder and intention-aware I²Former modules against ablated variants and established navigation baselines. Qualitatively, the full architecture consistently yields smoother trajectories, enhanced spatial awareness, and significantly reduced intrusion into personal space, particularly in dense or narrow settings. Ablation studies confirm that integrating both environmental occupancy features and pedestrian intention modeling is essential for flexible navigation, while real-world tests further demonstrate the system's robustness and social compliance under dynamic, occluded conditions.

The experiment evaluates the proposed navigation method against baselines and ablation variants in office and hospital settings. The full method consistently achieves the highest success rates and the lowest time spent in pedestrians' private zones, indicating superior safety and effectiveness. Ablation studies confirm that both the intention-aware module and the BEV representation are crucial, as their removal leads to degraded performance in success rate and personal space compliance. The proposed method achieves the highest success rates in both office and hospital environments compared to all baseline methods. Removing the intention-aware module results in a decrease in success rate and significantly increased time spent in pedestrians' private zones. The full method demonstrates the safest navigation behavior by maintaining the lowest time in private zones across both environments.

The authors evaluate their navigation method against several baselines and ablated variants across three distinct environments with varying widths and crowd densities. The results indicate that the full method achieves the highest success rates and most efficient navigation while minimizing intrusion into pedestrians' personal space. Removing key components like the intention-aware module or the BEV encoder leads to significant performance drops, particularly in success rate and social compliance. The proposed method consistently outperforms baselines in success rate, navigation efficiency, and social compliance across all tested environments. Ablation studies demonstrate that both the intention-aware module and the BEV encoder are critical for robust performance, as their absence increases collisions and social violations. In narrow and high-density scenarios, the method maintains superior stability and safety compared to baselines, which often exhibit rigid or inefficient navigation behaviors.

The proposed navigation method was evaluated across office, hospital, and varied environmental settings against standard baselines and ablated variants to assess its overall effectiveness and social compliance. The full system consistently demonstrates superior navigation success and efficiency while strictly minimizing intrusion into pedestrians' personal space, particularly in narrow or crowded conditions where baseline approaches exhibit rigid or unsafe behaviors. Ablation studies further validate that both the intention-aware module and the BEV representation are critical for robust performance, as their removal significantly compromises safety, increases social violations, and degrades overall reliability.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp