Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

End-to-end autonomous driving has made impressive progress in recent years.Existing methods usually adopt the decoupled encoder-decoder paradigm, wherethe encoder extracts hidden features from raw sensor data, and the decoderoutputs the ego-vehicle's future trajectories or actions. Under such aparadigm, the encoder does not have access to the intended behavior of the egoagent, leaving the burden of finding out safety-critical regions from themassive receptive field and inferring about future situations to the decoder.Even worse, the decoder is usually composed of several simple multi-layerperceptrons (MLP) or GRUs while the encoder is delicately designed (e.g., acombination of heavy ResNets or Transformer). Such an imbalanced resource-taskdivision hampers the learning process. In this work, we aim to alleviate the aforementioned problem by twoprinciples: (1) fully utilizing the capacity of the encoder; (2) increasing thecapacity of the decoder. Concretely, we first predict a coarse-grained futureposition and action based on the encoder features. Then, conditioned on theposition and action, the future scene is imagined to check the ramification ifwe drive accordingly. We also retrieve the encoder features around thepredicted coordinate to obtain fine-grained information about thesafety-critical region. Finally, based on the predicted future and theretrieved salient feature, we refine the coarse-grained position and action bypredicting its offset from ground-truth. The above refinement module could bestacked in a cascaded fashion, which extends the capacity of the decoder withspatial-temporal prior knowledge about the conditioned future. We conductexperiments on the CARLA simulator and achieve state-of-the-art performance inclosed-loop benchmarks. Extensive ablation studies demonstrate theeffectiveness of each proposed module.