HyperAIHyperAI

Command Palette

Search for a command to run...

RAD-2: Generator-Discriminator フレームワークにおける Reinforcement Learning のスケーリング

Hao Gao Shaoyu Chen Yifan Zhu Yuehao Song Wenyu Liu Qian Zhang Xinggang Wang

概要

ご指定いただいた基準に基づき、提供された英文を技術的に正確かつ自然な日本語に翻訳いたしました。翻訳文高度な自動運転には、マルチモーダルな将来の不確実性をモデル化しつつ、クローズドループ(closed-loop)の相互作用においてもロバスト性を維持できるモーションプランナー(motion planner)が求められます。Diffusionベースのプランナーは、複雑な軌道分布のモデル化には効果的である一方、純粋な模倣学習(imitation learning)のみで学習させた場合、確率的な不安定性や、修正のためのネガティブフィードバックの欠如といった課題に直面することが多々あります。これらの問題を解決するために、本研究ではクローズドループ・プランニングのための統一的なgenerator-discriminatorフレームワークである「RAD-2」を提案します。具体的には、Diffusionベースのgeneratorを用いて多様な軌道候補を生成し、RLによって最適化されたdiscriminatorが、長期的な運転の質に基づいてこれらの候補をリランキング(rerank)します。このデカップリング(decoupled)された設計により、高次元の軌道空間全体に対してスパースなスカラー報酬を直接適用することを回避し、最適化の安定性を向上させています。さらに、reinforcement learningを強化するために、時間的な一貫性を利用してクレジット割り当て(credit assignment)問題を軽減する「Temporally Consistent Group Relative Policy Optimization」を導入します。加えて、クローズドループのフィードバックを構造化された縦方向の最適化信号へと変換し、generatorを報酬の高い軌道多様体(manifold)へと段階的に移行させる「On-policy Generator Optimization」を提案します。効率的な大規模学習を支援するため、空間的なワーピング(warping)を通じてBird's-Eye View(BEV)の特徴空間上で直接クローズドループ評価を行う、高スループットなシミュレーション環境「BEV-Warp」を導入しました。実験の結果、RAD-2は強力なDiffusionベースのプランナーと比較して、衝突率を56%削減しました。実世界へのデプロイメントにおいても、複雑な都市交通環境における知覚的安全性の向上と運転の滑らかさが実証されました。

One-sentence Summary

To improve stability and correct imitation learning errors in multimodal autonomous driving scenarios, the researchers propose RAD-2, a unified generator-discriminator framework that utilizes a diffusion-based generator to produce trajectory candidates and an RL-optimized discriminator to rerank them, while employing Temporally Consistent Group Relative Policy Optimization to enhance reinforcement learning.

Key Contributions

  • The paper introduces RAD-2, a unified generator-discriminator framework that utilizes a diffusion-based generator for diverse trajectory production and an RL-optimized discriminator for reranking candidates based on long-term driving quality.
  • This work presents Temporally Consistent Group Relative Policy Optimization to improve credit assignment through temporal coherence and On-policy Generator Optimization to refine trajectory distributions using structured longitudinal signals.
  • The authors develop BEV-Warp, a high-throughput feature-level simulation pipeline that warps BEV features around the ego vehicle to enable scalable closed-loop training without expensive image-level rendering.

Introduction

High-level autonomous driving requires motion planners that can model multimodal uncertainty while remaining robust during closed-loop interactions. While diffusion-based planners excel at capturing complex trajectory distributions, they often face stochastic instabilities and lack corrective feedback when trained solely through imitation learning. Furthermore, applying reinforcement learning directly to high-dimensional trajectory spaces is difficult due to sparse rewards and severe credit assignment challenges. The authors leverage a unified generator-discriminator framework to decouple these tasks, using a diffusion-based generator to produce diverse candidates and an RL-optimized discriminator to rerank them based on long-term driving quality. To support this, they introduce Temporally Consistent Group Relative Policy Optimization to stabilize the RL search space and BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation in the Bird's-Eye View feature space to bypass expensive image-level rendering.

Dataset

Dataset overview
Dataset overview

Dataset Overview

The authors utilize a multi-stage dataset strategy to train and evaluate their motion generation and reinforcement learning frameworks:

  • Generator Pretraining Data: The authors use approximately 50,000 hours of real-world driving data containing ego-vehicle trajectories. This large-scale dataset is used to pre-train the motion generator to capture the multimodal distribution of human driving behaviors.
  • BEV Warp Environment Subsets:
    • Source and Filtering: The authors initially collect 50,000 clips from real-world logs, with each clip lasting 10 to 20 seconds. These clips undergo closed-loop simulation in the BEV Warp environment to identify specific driving behaviors. Clips are filtered to isolate safety-critical scenarios (high collision risk) and efficiency-related scenarios (suboptimal performance).
    • Training Sets: Two curated training sets are created for closed-loop reinforcement learning, each containing 10,000 clips focused on either safety or efficiency objectives.
    • Evaluation Sets: Two disjoint subsets of 512 clips each are constructed for closed-loop assessment, corresponding to the safety and efficiency categories.
  • 3DGS Environment Subsets:
    • Source: The authors utilize the photorealistic 3D Gaussian Splatting (3DGS) simulation benchmark from Senna-2, which focuses on high-risk safety scenarios.
    • Usage: 1,044 clips are used to train the trajectory discriminator, while 256 clips are reserved for closed-loop evaluation.
  • Open-loop Evaluation Scenarios: The authors adopt the Senna-2 open-loop evaluation dataset to test planning quality across six representative scenarios: car-following start, car-following stop, lane changing, intersections, curves, and heavy braking.

Method

The proposed framework, RAD-2, employs a generator-discriminator architecture to achieve robust and safe motion planning in autonomous driving. This design decouples the high-dimensional trajectory generation process from the low-dimensional reinforcement learning (RL) optimization, enabling stable and efficient policy training. As illustrated in the framework diagram, the system operates through two primary components: a diffusion-based generator and an RL-trained discriminator. The generator produces a diverse set of candidate trajectories conditioned on the current observation, while the discriminator evaluates and reranks these candidates based on their expected long-term outcomes. This joint policy is defined as \\Pi_{\\theta,\\phi}(\\tau|o) = \\mathbb{E}_{c \\sim \\mathcal{G}_\\theta(\\cdot|o)}[\\mathcal{D}_\\phi(\\tau|o, \\mathcal{C})], which allows for a structured policy where the generator explores a broad space of feasible actions and the discriminator selectively prioritizes higher-quality behaviors. The architecture inherently supports inference-time scaling by increasing the number of candidate trajectories without requiring retraining.

Framework of the proposed method
Framework of the proposed method

The diffusion-based generator, as shown in the diagram, models a multimodal distribution over future trajectories. It first encodes the current observation oto_tot into Bird's Eye View (BEV) features TbT_bTb and extracts scene-specific information from static map elements, dynamic agents, and navigation inputs. These components are processed by lightweight encoders to obtain token embeddings, which are then fused with the BEV features to form a unified scene embedding EtextsceneE_{\\text{scene}}Etextscene. This embedding conditions a DiT-based trajectory generator via cross-attention. For MMM independent modes, the generator iteratively denoises an initial noise trajectory over KKK steps to produce a set of candidate trajectories widehatmathcalT\\widehat{\\mathcal{T}}widehatmathcalT. These trajectories are passed to the discriminator for evaluation.

Illustration of the framework for multimodal trajectory planning paradigms
Illustration of the framework for multimodal trajectory planning paradigms

The discriminator evaluates candidate trajectories by first encoding each point in the trajectory via a shared MLP, and then processing the resulting sequence with a Transformer encoder to produce a trajectory-level query QtauQ_{\\tau}Qtau. This query aggregates information from the entire trajectory and is used to interact with the scene context. The scene representation is constructed from the same inputs as the generator, using independent encoders for static and dynamic elements. The trajectory-query interacts with the scene context through cross-attention mechanisms to produce fused embeddings EtextfusionE_{\\text{fusion}}Etextfusion. A final sigmoid activation applied to this fused representation produces a scalar score for each candidate trajectory, which is used for reranking. This process enables the discriminator to provide precise, long-term outcome-based feedback, which is then used to guide the optimization of the generator.

Schematic of the BEV-Warp simulation environment
Schematic of the BEV-Warp simulation environment

To scale the RL training process, the system leverages a high-throughput, feature-level simulation environment called BEV-Warp. This environment enables efficient closed-loop interaction by manipulating BEV features directly, bypassing the need for expensive image-level rendering. The simulation is initialized from real-world sequences, and at each timestep, the system extracts a reference BEV feature and the current agent pose. The planner generates candidate trajectories, from which an optimal one is selected. To maintain temporal coherence and ensure stable exploration, the system employs a trajectory reuse mechanism. Once a trajectory is selected, its corresponding control commands are executed over a fixed horizon, stabilizing the agent's motion. This mechanism ensures that the cumulative reward accurately reflects the quality of the selected trajectory, facilitating effective policy gradients. The closed-loop evaluation is driven by a recursive feature-warping mechanism, where a warp matrix derived from the relative pose deviation between the simulated agent and the logged reference is applied to the reference BEV feature to synthesize the next high-fidelity observation.

Spatial equivariance in BEV-Warp
Spatial equivariance in BEV-Warp

The joint policy optimization is realized through a multi-stage iterative process. The global objective is to minimize the KL-divergence between the hybrid policy and an ideal high-efficiency distribution. The training pipeline consists of three stages: (i) Temporally Consistent Rollout, which collects stable closed-loop interaction data; (ii) Discriminator Optimization, where the discriminator is optimized via a Temporally Consistent Group Relative Policy Optimization (TC-GRPO) framework to enhance its scoring precision; and (iii) Generator Optimization, which employs On-policy Generator Optimization (OGO) to shift the generator's distribution toward safer and more efficient behaviors. The TC-GRPO framework introduces a structured rollout and reward assignment mechanism to address the credit assignment problem in continuous driving, ensuring that the sparse environment reward is directly attributed to the specific trajectory hypothesis sustained within each persistent interval. The OGO mechanism converts closed-loop reward signals into structured longitudinal optimizations, adjusting the acceleration profile of raw trajectory segments to better align with safety and efficiency goals. This allows the generator to iteratively shift its output distribution toward favorable long-term outcomes without compromising stability.

Training pipeline of RAD-2
Training pipeline of RAD-2

The training process begins with a pre-training stage where the diffusion-based generator is initialized via imitation learning on expert demonstrations to capture multi-modal trajectory priors. This is followed by a closed-loop rollout phase where the joint policy interacts with the BEV-Warp environment to generate diverse rollout data. The discriminator is then optimized via the TC-GRPO framework, leveraging the closed-loop feedback to enhance its ability to rank trajectories. Finally, the generator is optimized through OGO, which uses the structured longitudinal optimization signals derived from low-reward rollouts to refine its distribution. The system employs a cyclic optimization loop, with the discriminator updated more frequently than the generator, ensuring continuous co-adaptation. The entire framework enables a self-improving closed loop, where the generator and discriminator jointly optimize the overall policy, progressively shifting the trajectory distribution toward safer and more efficient behaviors.

Experiment

The proposed method is evaluated through closed-loop simulations in BEV Warp and 3DGS environments to assess interactive driving behavior, alongside open-loop benchmarks to validate trajectory accuracy. Results demonstrate that the synergistic joint optimization of the generator and discriminator significantly improves the balance between safety and efficiency compared to decoupled or single-objective training strategies. Qualitative analysis and ablation studies further confirm that the framework achieves superior collision avoidance and smoother navigation through effective reward-based filtering and robust inference-time scaling.

The the the table presents an ablation study on the execution horizon, showing how different values affect collision rate, safety, and efficiency metrics. The results indicate that an intermediate horizon achieves the best balance between performance and stability. An execution horizon of 8 achieves the highest efficiency and the best trade-off between safety and collision rate. Lower horizons lead to higher collision rates and reduced safety, while higher horizons decrease efficiency. The optimal horizon balances stable credit assignment with reactive flexibility for effective training.

Ablation on execution horizon
Ablation on execution horizon

The authors compare several methods on open-loop trajectory accuracy using a benchmark that includes various driving scenarios. Results show that the proposed method achieves the lowest collision rate and the best trajectory quality metrics, outperforming prior approaches across all evaluated measures. The proposed method achieves the lowest collision rate and trajectory error metrics compared to all baseline methods. The method significantly reduces both dynamic and static collision components in open-loop scenarios. It demonstrates superior trajectory quality with the lowest ADE and FDE values among the evaluated approaches.

Open-loop trajectory accuracy comparison
Open-loop trajectory accuracy comparison

The results show that enabling clip filtering improves efficiency while maintaining safety in trajectory planning. This indicates that filtering out low-variance scenarios leads to more stable and effective training outcomes. Clip filtering improves efficiency without compromising safety Enabling clip filtering enhances [email protected] performance Filtering low-variance clips stabilizes training dynamics

Effect of clip filtering
Effect of clip filtering

The authors compare two reinforcement learning objectives, one without and one with an entropy term, to evaluate their impact on safety and efficiency. Results show that incorporating the entropy term improves safety and efficiency metrics compared to the version without it. Including the entropy term in the RL objective improves safety and efficiency The version with the entropy term achieves a lower collision rate The version with the entropy term achieves higher safety and efficiency scores

RL objective ablation study
RL objective ablation study

The the the table shows the impact of different group sizes on model performance, with a group size of 4 achieving the best balance between safety and efficiency metrics. Larger group sizes lead to a decline in safety performance while improving efficiency slightly. A group size of 4 achieves the highest safety and efficiency scores. Increasing the group size beyond 4 reduces safety metrics. Efficiency improves with larger group sizes, but at the cost of safety.

Ablation on group size
Ablation on group size

The experiments evaluate various architectural and training components through ablation studies and comparative benchmarks to optimize driving performance. Results demonstrate that an intermediate execution horizon, the inclusion of an entropy term in the RL objective, and the use of clip filtering all contribute to more stable and efficient training. Furthermore, the proposed method outperforms baseline approaches in open-loop trajectory accuracy, while an optimal group size is necessary to balance safety with operational efficiency.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています