A Dual-Cycled Cross-View Transformer Network for Unified Road Layout Estimation and 3D Object Detection in the Bird's-Eye-View

The bird's-eye-view (BEV) representation allows robust learning of multipletasks for autonomous driving including road layout estimation and 3D objectdetection. However, contemporary methods for unified road layout estimation and3D object detection rarely handle the class imbalance of the training datasetand multi-class learning to reduce the total number of networks required. Toovercome these limitations, we propose a unified model for road layoutestimation and 3D object detection inspired by the transformer architecture andthe CycleGAN learning framework. The proposed model deals with the performancedegradation due to the class imbalance of the dataset utilizing the focal lossand the proposed dual cycle loss. Moreover, we set up extensive learningscenarios to study the effect of multi-class learning for road layoutestimation in various situations. To verify the effectiveness of the proposedmodel and the learning scheme, we conduct a thorough ablation study and acomparative study. The experiment results attest the effectiveness of ourmodel; we achieve state-of-the-art performance in both the road layoutestimation and 3D object detection tasks.