Markerless Camera-to-Robot Pose Estimation via Self-supervised Sim-to-Real Transfer

Solving the camera-to-robot pose is a fundamental requirement forvision-based robot control, and is a process that takes considerable effort andcares to make accurate. Traditional approaches require modification of therobot via markers, and subsequent deep learning approaches enabled markerlessfeature extraction. Mainstream deep learning methods only use synthetic dataand rely on Domain Randomization to fill the sim-to-real gap, because acquiringthe 3D annotation is labor-intensive. In this work, we go beyond the limitationof 3D annotations for real-world data. We propose an end-to-end pose estimationframework that is capable of online camera-to-robot calibration and aself-supervised training method to scale the training to unlabeled real-worlddata. Our framework combines deep learning and geometric vision for solving therobot pose, and the pipeline is fully differentiable. To train theCamera-to-Robot Pose Estimation Network (CtRNet), we leverage foregroundsegmentation and differentiable rendering for image-level self-supervision. Thepose prediction is visualized through a renderer and the image loss with theinput image is back-propagated to train the neural network. Our experimentalresults on two public real datasets confirm the effectiveness of our approachover existing works. We also integrate our framework into a visual servoingsystem to demonstrate the promise of real-time precise robot pose estimationfor automation tasks.