GSNet: Joint Vehicle Pose and Shape Reconstruction with Geometrical and Scene-aware Supervision

We present a novel end-to-end framework named as GSNet (Geometric andScene-aware Network), which jointly estimates 6DoF poses and reconstructsdetailed 3D car shapes from single urban street view. GSNet utilizes a uniquefour-way feature extraction and fusion scheme and directly regresses 6DoF posesand shapes in a single forward pass. Extensive experiments show that ourdiverse feature extraction and fusion scheme can greatly improve modelperformance. Based on a divide-and-conquer 3D shape representation strategy,GSNet reconstructs 3D vehicle shape with great detail (1352 vertices and 2700faces). This dense mesh representation further leads us to consider geometricalconsistency and scene context, and inspires a new multi-objective loss functionto regularize network training, which in turn improves the accuracy of 6D poseestimation and validates the merit of jointly performing both tasks. Weevaluate GSNet on the largest multi-task ApolloCar3D benchmark and achievestate-of-the-art performance both quantitatively and qualitatively. Projectpage is available at https://lkeab.github.io/gsnet/.