HyperAIHyperAI

Command Palette

Search for a command to run...

RAD-2 : Mise à l'échelle du Reinforcement Learning dans un framework Générateur-Discriminateur

Hao Gao Shaoyu Chen Yifan Zhu Yuehao Song Wenyu Liu Qian Zhang Xinggang Wang

Résumé

Voici la traduction de votre texte en français, réalisée selon les standards de la communication scientifique et technologique :La conduite autonome de haut niveau exige des planificateurs de mouvement capables de modéliser les incertitudes multimodales futures tout en restant robustes lors des interactions en boucle fermée (closed-loop). Bien que les planificateurs basés sur la Diffusion soient efficaces pour modéliser des distributions de trajectoires complexes, ils souffrent souvent d'instabilités stochastiques et d'un manque de rétroaction négative corrective lorsqu'ils sont entraînés uniquement par apprentissage par imitation (imitation learning).Pour remédier à ces problèmes, nous proposons RAD-2, un cadre unifié générateur-discriminateur pour la planification en boucle fermée. Plus précisément, un générateur basé sur la Diffusion est utilisé pour produire divers candidats de trajectoire, tandis qu'un discriminateur optimisé par RL (apprentissage par renforcement) réordonne ces candidats en fonction de la qualité de conduite à long terme. Cette conception découplée évite l'application directe de récompenses scalaires éparses à l'ensemble de l'espace de trajectoire de haute dimension, améliorant ainsi la stabilité de l'optimisation.Pour renforcer davantage l'apprentissage par renforcement, nous introduisons la « Temporally Consistent Group Relative Policy Optimization » (Optimisation de politique relative de groupe temporellement cohérente), qui exploite la cohérence temporelle pour atténuer le problème de l'attribution de crédit (credit assignment problem). De plus, nous proposons l'« On-policy Generator Optimization » (Optimisation du générateur on-policy), qui convertit la rétroaction en boucle fermée en signaux d'optimisation longitudinaux structurés et déplace progressivement le générateur vers des variétés (manifolds) de trajectoires à haute récompense.Pour soutenir un entraînement à grande échelle et efficace, nous introduisons BEV-Warp, un environnement de simulation à haut débit qui effectue l'évaluation en boucle fermée directement dans l'espace de caractéristiques BEV (Bird's-Eye View) via un déformage spatial (spatial warping). RAD-2 réduit le taux de collision de 56 % par rapport aux planificateurs puissants basés sur la Diffusion. Le déploiement en conditions réelles démontre en outre une amélioration de la sécurité perçue et de la fluidité de conduite dans un trafic urbain complexe.

One-sentence Summary

To improve stability and correct imitation learning errors in multimodal autonomous driving scenarios, the researchers propose RAD-2, a unified generator-discriminator framework that utilizes a diffusion-based generator to produce trajectory candidates and an RL-optimized discriminator to rerank them, while employing Temporally Consistent Group Relative Policy Optimization to enhance reinforcement learning.

Key Contributions

  • The paper introduces RAD-2, a unified generator-discriminator framework that utilizes a diffusion-based generator for diverse trajectory production and an RL-optimized discriminator for reranking candidates based on long-term driving quality.
  • This work presents Temporally Consistent Group Relative Policy Optimization to improve credit assignment through temporal coherence and On-policy Generator Optimization to refine trajectory distributions using structured longitudinal signals.
  • The authors develop BEV-Warp, a high-throughput feature-level simulation pipeline that warps BEV features around the ego vehicle to enable scalable closed-loop training without expensive image-level rendering.

Introduction

High-level autonomous driving requires motion planners that can model multimodal uncertainty while remaining robust during closed-loop interactions. While diffusion-based planners excel at capturing complex trajectory distributions, they often face stochastic instabilities and lack corrective feedback when trained solely through imitation learning. Furthermore, applying reinforcement learning directly to high-dimensional trajectory spaces is difficult due to sparse rewards and severe credit assignment challenges. The authors leverage a unified generator-discriminator framework to decouple these tasks, using a diffusion-based generator to produce diverse candidates and an RL-optimized discriminator to rerank them based on long-term driving quality. To support this, they introduce Temporally Consistent Group Relative Policy Optimization to stabilize the RL search space and BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation in the Bird's-Eye View feature space to bypass expensive image-level rendering.

Dataset

Dataset overview
Dataset overview

Dataset Overview

The authors utilize a multi-stage dataset strategy to train and evaluate their motion generation and reinforcement learning frameworks:

  • Generator Pretraining Data: The authors use approximately 50,000 hours of real-world driving data containing ego-vehicle trajectories. This large-scale dataset is used to pre-train the motion generator to capture the multimodal distribution of human driving behaviors.
  • BEV Warp Environment Subsets:
    • Source and Filtering: The authors initially collect 50,000 clips from real-world logs, with each clip lasting 10 to 20 seconds. These clips undergo closed-loop simulation in the BEV Warp environment to identify specific driving behaviors. Clips are filtered to isolate safety-critical scenarios (high collision risk) and efficiency-related scenarios (suboptimal performance).
    • Training Sets: Two curated training sets are created for closed-loop reinforcement learning, each containing 10,000 clips focused on either safety or efficiency objectives.
    • Evaluation Sets: Two disjoint subsets of 512 clips each are constructed for closed-loop assessment, corresponding to the safety and efficiency categories.
  • 3DGS Environment Subsets:
    • Source: The authors utilize the photorealistic 3D Gaussian Splatting (3DGS) simulation benchmark from Senna-2, which focuses on high-risk safety scenarios.
    • Usage: 1,044 clips are used to train the trajectory discriminator, while 256 clips are reserved for closed-loop evaluation.
  • Open-loop Evaluation Scenarios: The authors adopt the Senna-2 open-loop evaluation dataset to test planning quality across six representative scenarios: car-following start, car-following stop, lane changing, intersections, curves, and heavy braking.

Method

The proposed framework, RAD-2, employs a generator-discriminator architecture to achieve robust and safe motion planning in autonomous driving. This design decouples the high-dimensional trajectory generation process from the low-dimensional reinforcement learning (RL) optimization, enabling stable and efficient policy training. As illustrated in the framework diagram, the system operates through two primary components: a diffusion-based generator and an RL-trained discriminator. The generator produces a diverse set of candidate trajectories conditioned on the current observation, while the discriminator evaluates and reranks these candidates based on their expected long-term outcomes. This joint policy is defined as \\Pi_{\\theta,\\phi}(\\tau|o) = \\mathbb{E}_{c \\sim \\mathcal{G}_\\theta(\\cdot|o)}[\\mathcal{D}_\\phi(\\tau|o, \\mathcal{C})], which allows for a structured policy where the generator explores a broad space of feasible actions and the discriminator selectively prioritizes higher-quality behaviors. The architecture inherently supports inference-time scaling by increasing the number of candidate trajectories without requiring retraining.

Framework of the proposed method
Framework of the proposed method

The diffusion-based generator, as shown in the diagram, models a multimodal distribution over future trajectories. It first encodes the current observation oto_tot into Bird's Eye View (BEV) features TbT_bTb and extracts scene-specific information from static map elements, dynamic agents, and navigation inputs. These components are processed by lightweight encoders to obtain token embeddings, which are then fused with the BEV features to form a unified scene embedding EtextsceneE_{\\text{scene}}Etextscene. This embedding conditions a DiT-based trajectory generator via cross-attention. For MMM independent modes, the generator iteratively denoises an initial noise trajectory over KKK steps to produce a set of candidate trajectories widehatmathcalT\\widehat{\\mathcal{T}}widehatmathcalT. These trajectories are passed to the discriminator for evaluation.

Illustration of the framework for multimodal trajectory planning paradigms
Illustration of the framework for multimodal trajectory planning paradigms

The discriminator evaluates candidate trajectories by first encoding each point in the trajectory via a shared MLP, and then processing the resulting sequence with a Transformer encoder to produce a trajectory-level query QtauQ_{\\tau}Qtau. This query aggregates information from the entire trajectory and is used to interact with the scene context. The scene representation is constructed from the same inputs as the generator, using independent encoders for static and dynamic elements. The trajectory-query interacts with the scene context through cross-attention mechanisms to produce fused embeddings EtextfusionE_{\\text{fusion}}Etextfusion. A final sigmoid activation applied to this fused representation produces a scalar score for each candidate trajectory, which is used for reranking. This process enables the discriminator to provide precise, long-term outcome-based feedback, which is then used to guide the optimization of the generator.

Schematic of the BEV-Warp simulation environment
Schematic of the BEV-Warp simulation environment

To scale the RL training process, the system leverages a high-throughput, feature-level simulation environment called BEV-Warp. This environment enables efficient closed-loop interaction by manipulating BEV features directly, bypassing the need for expensive image-level rendering. The simulation is initialized from real-world sequences, and at each timestep, the system extracts a reference BEV feature and the current agent pose. The planner generates candidate trajectories, from which an optimal one is selected. To maintain temporal coherence and ensure stable exploration, the system employs a trajectory reuse mechanism. Once a trajectory is selected, its corresponding control commands are executed over a fixed horizon, stabilizing the agent's motion. This mechanism ensures that the cumulative reward accurately reflects the quality of the selected trajectory, facilitating effective policy gradients. The closed-loop evaluation is driven by a recursive feature-warping mechanism, where a warp matrix derived from the relative pose deviation between the simulated agent and the logged reference is applied to the reference BEV feature to synthesize the next high-fidelity observation.

Spatial equivariance in BEV-Warp
Spatial equivariance in BEV-Warp

The joint policy optimization is realized through a multi-stage iterative process. The global objective is to minimize the KL-divergence between the hybrid policy and an ideal high-efficiency distribution. The training pipeline consists of three stages: (i) Temporally Consistent Rollout, which collects stable closed-loop interaction data; (ii) Discriminator Optimization, where the discriminator is optimized via a Temporally Consistent Group Relative Policy Optimization (TC-GRPO) framework to enhance its scoring precision; and (iii) Generator Optimization, which employs On-policy Generator Optimization (OGO) to shift the generator's distribution toward safer and more efficient behaviors. The TC-GRPO framework introduces a structured rollout and reward assignment mechanism to address the credit assignment problem in continuous driving, ensuring that the sparse environment reward is directly attributed to the specific trajectory hypothesis sustained within each persistent interval. The OGO mechanism converts closed-loop reward signals into structured longitudinal optimizations, adjusting the acceleration profile of raw trajectory segments to better align with safety and efficiency goals. This allows the generator to iteratively shift its output distribution toward favorable long-term outcomes without compromising stability.

Training pipeline of RAD-2
Training pipeline of RAD-2

The training process begins with a pre-training stage where the diffusion-based generator is initialized via imitation learning on expert demonstrations to capture multi-modal trajectory priors. This is followed by a closed-loop rollout phase where the joint policy interacts with the BEV-Warp environment to generate diverse rollout data. The discriminator is then optimized via the TC-GRPO framework, leveraging the closed-loop feedback to enhance its ability to rank trajectories. Finally, the generator is optimized through OGO, which uses the structured longitudinal optimization signals derived from low-reward rollouts to refine its distribution. The system employs a cyclic optimization loop, with the discriminator updated more frequently than the generator, ensuring continuous co-adaptation. The entire framework enables a self-improving closed loop, where the generator and discriminator jointly optimize the overall policy, progressively shifting the trajectory distribution toward safer and more efficient behaviors.

Experiment

The proposed method is evaluated through closed-loop simulations in BEV Warp and 3DGS environments to assess interactive driving behavior, alongside open-loop benchmarks to validate trajectory accuracy. Results demonstrate that the synergistic joint optimization of the generator and discriminator significantly improves the balance between safety and efficiency compared to decoupled or single-objective training strategies. Qualitative analysis and ablation studies further confirm that the framework achieves superior collision avoidance and smoother navigation through effective reward-based filtering and robust inference-time scaling.

The the the table presents an ablation study on the execution horizon, showing how different values affect collision rate, safety, and efficiency metrics. The results indicate that an intermediate horizon achieves the best balance between performance and stability. An execution horizon of 8 achieves the highest efficiency and the best trade-off between safety and collision rate. Lower horizons lead to higher collision rates and reduced safety, while higher horizons decrease efficiency. The optimal horizon balances stable credit assignment with reactive flexibility for effective training.

Ablation on execution horizon
Ablation on execution horizon

The authors compare several methods on open-loop trajectory accuracy using a benchmark that includes various driving scenarios. Results show that the proposed method achieves the lowest collision rate and the best trajectory quality metrics, outperforming prior approaches across all evaluated measures. The proposed method achieves the lowest collision rate and trajectory error metrics compared to all baseline methods. The method significantly reduces both dynamic and static collision components in open-loop scenarios. It demonstrates superior trajectory quality with the lowest ADE and FDE values among the evaluated approaches.

Open-loop trajectory accuracy comparison
Open-loop trajectory accuracy comparison

The results show that enabling clip filtering improves efficiency while maintaining safety in trajectory planning. This indicates that filtering out low-variance scenarios leads to more stable and effective training outcomes. Clip filtering improves efficiency without compromising safety Enabling clip filtering enhances [email protected] performance Filtering low-variance clips stabilizes training dynamics

Effect of clip filtering
Effect of clip filtering

The authors compare two reinforcement learning objectives, one without and one with an entropy term, to evaluate their impact on safety and efficiency. Results show that incorporating the entropy term improves safety and efficiency metrics compared to the version without it. Including the entropy term in the RL objective improves safety and efficiency The version with the entropy term achieves a lower collision rate The version with the entropy term achieves higher safety and efficiency scores

RL objective ablation study
RL objective ablation study

The the the table shows the impact of different group sizes on model performance, with a group size of 4 achieving the best balance between safety and efficiency metrics. Larger group sizes lead to a decline in safety performance while improving efficiency slightly. A group size of 4 achieves the highest safety and efficiency scores. Increasing the group size beyond 4 reduces safety metrics. Efficiency improves with larger group sizes, but at the cost of safety.

Ablation on group size
Ablation on group size

The experiments evaluate various architectural and training components through ablation studies and comparative benchmarks to optimize driving performance. Results demonstrate that an intermediate execution horizon, the inclusion of an entropy term in the RL objective, and the use of clip filtering all contribute to more stable and efficient training. Furthermore, the proposed method outperforms baseline approaches in open-loop trajectory accuracy, while an optimal group size is necessary to balance safety with operational efficiency.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp