Fusing Visual Appearance and Geometry for Multi-modality 6DoF Object Tracking

In many applications of advanced robotic manipulation, six degrees of freedom(6DoF) object pose estimates are continuously required. In this work, wedevelop a multi-modality tracker that fuses information from visual appearanceand geometry to estimate object poses. The algorithm extends our previousmethod ICG, which uses geometry, to additionally consider surface appearance.In general, object surfaces contain local characteristics from text, graphics,and patterns, as well as global differences from distinct materials and colors.To incorporate this visual information, two modalities are developed. For localcharacteristics, keypoint features are used to minimize distances betweenpoints from keyframes and the current image. For global differences, a novelregion approach is developed that considers multiple regions on the objectsurface. In addition, it allows the modeling of external geometries.Experiments on the YCB-Video and OPT datasets demonstrate that our approachICG+ performs best on both datasets, outperforming both conventional and deeplearning-based methods. At the same time, the algorithm is highly efficient andruns at more than 300 Hz. The source code of our tracker is publicly available.