Command Palette
Search for a command to run...
Optimisation de trajectoire de dépassement autonome par apprentissage par renforcement et estimation de la pose de l'adversaire
Optimisation de trajectoire de dépassement autonome par apprentissage par renforcement et estimation de la pose de l'adversaire
Matej Rene Cihlar Luka Šiktar Branimir Ćaran Marko Švaco
Résumé
Le dépassement de véhicule constitue l'une des manœuvres de conduite les plus complexes pour les véhicules autonomes. Afin d'optimiser les dépassements autonomes, les systèmes de conduite s'appuient sur une pluralité de capteurs permettant d'assurer une optimisation sûre de la trajectoire et une efficacité accrue lors du dépassement. Cet article présente un mécanisme d'apprentissage par renforcement conçu pour des environnements de course à agents multiples, permettant l'optimisation de la trajectoire de dépassement à partir de données LiDAR et d'images de profondeur. L'agent d'apprentissage par renforcement développé exploite des données de ligne de course prégénérées ainsi que les entrées des capteurs pour calculer l'angle de braquage et la vitesse linéaire optimaux en vue d'un dépassement efficace. Le système utilise un capteur LiDAR couplé à un algorithme de détection 2D, ainsi qu'une caméra de profondeur équipée d'un module de détection d'objets basé sur YOLO, afin d'identifier le véhicule à dépasser et d'estimer sa pose. Les données de détection issues du LiDAR et de la caméra de profondeur sont fusionnées à l'aide d'un filtre de Kalman non linéaire (UKF) afin d'améliorer l'estimation de la pose de l'adversaire et d'optimiser la trajectoire de dépassement dans des scénarios de course. Les résultats démontrent que l'algorithme proposé réalise avec succès des manœuvres de dépassement aussi bien en simulation que lors d'expériences en conditions réelles, avec une erreur quadratique moyenne (RMSE) de la pose estimée de (0,0816 ; 0,0531) m selon les axes (x, y).
One-sentence Summary
Researchers from the University of Zagreb propose a multi-agent reinforcement learning framework using PPO to optimize autonomous overtaking in racing scenarios. By fusing LiDAR and YOLOv8 depth data via an Unscented Kalman Filter for robust opponent pose estimation, their system achieves successful real-world maneuvers on the F1TENTH platform, surpassing single-agent limitations.
Key Contributions
- The paper introduces a reinforcement learning agent using PPO that computes steering and velocity for optimal overtaking by integrating pre-generated raceline data with real-time sensor inputs in multi-agent racing environments.
- A sensor fusion system combining 2D LiDAR detection and YOLOv8-based depth camera data is implemented with an Unscented Kalman Filter to generate robust opponent pose estimates for trajectory planning.
- Experimental validation on a real-world F1TENTH platform demonstrates successful overtaking maneuvers in both simulation and physical tests, achieving a pose estimation RMSE of (0.0816, 0.0531) m in (x, y) coordinates.
Introduction
Autonomous overtaking in multi-agent racing environments demands precise trajectory optimization to balance speed and safety under dynamic, adversarial conditions. Prior approaches often rely on single-agent reinforcement learning that follows static racing lines or depend on simulation without real-world validation, while existing sensor fusion methods have not been fully adapted for the high-stakes context of autonomous racing. The authors leverage a reinforcement learning agent trained with PPO that incorporates real-time opponent pose estimation derived from fusing 2D LiDAR and depth camera data via an Unscented Kalman Filter. This system enables the vehicle to compute optimal steering and velocity commands for overtaking maneuvers, which the team successfully validated on a real-world F1TENTH platform with low pose estimation error.

Method
The proposed system integrates two primary components: pose estimation via sensor fusion and a reinforcement learning algorithm designed for overtaking maneuvers. The physical implementation relies on the F1TENTH platform equipped with a Jetson Xavier NX computer for processing. Sensor inputs are provided by a Hokuyo 2D LiDAR and an Intel RealSense D435 depth camera. The final hardware configuration is illustrated in the figure below.
Opponent pose estimation is performed relative to the RL agent using the on-board sensors. The LiDAR data undergoes object detection through clustering and rectangle fitting, where the most likely opponent cluster is selected based on the previous timestep estimate. Simultaneously, the depth camera utilizes both RGB and depth data. For RGB processing, a YOLOv8 object detection model identifies the opponent car's bounding box. Distance estimation leverages the bounding box height using a reciprocal function.
distance=b⋅height+ba+c
The parameters in this equation are determined experimentally. This method remains valid as the opponent is consistently viewed from the side, meaning height variation correlates primarily with distance rather than orientation. The yaw angle is approximated using the Ackermann steering model. Depth data is extracted from the bounding box center pixel location, with an average offset added to account for the measurement point on the opponent car. These measurements are fused using an Unscented Kalman Filter (UKF). The UKF state vector is defined as [x,y,vx,vy] utilizing a constant velocity model. Since the sensors operate at different publishing rates and formats, the filter updates on every measurement with static transformations applied for the sensor frames.
For the decision-making process, the authors implement a multi-agent reinforcement learning (MARL) algorithm to learn overtaking maneuvers. The Proximal Policy Optimization (PPO) algorithm is selected to ensure training stability through a clipped loss function.
LCLIP(θ)=Et[min(rt(θ)At,clip(rt(θ),1−ϵ,1+ϵ)At)]
Here, rt(θ) represents the probability ratio between the new and old policy, A^t is the advantage estimate at time step t, and ϵ denotes the clipping range hyperparameter. The neural network architecture consists of a fully connected feedforward network with two hidden layers containing 256 neurons each, activated by ReLU functions. The input layer size is 162, corresponding to the observation size, while the output layer size is 2 with a tanh activation function. In the multi-agent setup, the opponent's actions are generated by the same neural network policy, and data from both agents contribute to the training process.
The environment is simulated using the F1TENTH gym, which models the car as a simple bicycle model and handles collisions via a 2D occupancy grid map. The simulation replicates the real-world LiDAR sensor with 1080 points per scan over 270 degrees. The observation space includes agent LiDAR data, odometry for both cars, and future waypoints. To reduce complexity, every 10th LiDAR point is sampled, resulting in 108 points describing surrounding obstacles. Waypoints are derived from a precalculated racing line generated using the minimum curvature approach. The action space comprises continuous steering angle and linear velocity commands. The steering angle is scaled to ± 0.34 rad, and linear velocity ranges from 0 to 3 m/s. The reward signal is a weighted sum of seven components including velocity, progress, overtaking status, raceline adherence, collision avoidance, heading error, and smoothness.
Experiment
- Opponent pose estimation using a UKF successfully tracks the opponent by fusing camera and LiDAR data, providing robust tracking that prevents the LiDAR from locking onto incorrect objects and enabling continuous updates at 10 Hz despite slower sensor rates.
- The UKF demonstrates superior performance in the y-direction compared to individual sensors, though it is less accurate than LiDAR alone in the x-direction, with estimation error increasing as distance grows.
- A real-world RL agent deployed on an F1TENTH car successfully executes overtaking maneuvers without collisions in conditions similar to its training environment, though trajectory oscillations are amplified by real-world noise and delays.
- The agent fails to generalize to significantly different track geometries, such as smaller looped tracks, highlighting a dependency on training conditions for successful deployment.