PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

Estimating the 6D pose of known objects is important for robots to interactwith the real world. The problem is challenging due to the variety of objectsas well as the complexity of a scene caused by clutter and occlusions betweenobjects. In this work, we introduce PoseCNN, a new Convolutional Neural Networkfor 6D object pose estimation. PoseCNN estimates the 3D translation of anobject by localizing its center in the image and predicting its distance fromthe camera. The 3D rotation of the object is estimated by regressing to aquaternion representation. We also introduce a novel loss function that enablesPoseCNN to handle symmetric objects. In addition, we contribute a large scalevideo dataset for 6D object pose estimation named the YCB-Video dataset. Ourdataset provides accurate 6D poses of 21 objects from the YCB dataset observedin 92 videos with 133,827 frames. We conduct extensive experiments on ourYCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN ishighly robust to occlusions, can handle symmetric objects, and provide accuratepose estimation using only color images as input. When using depth data tofurther refine the poses, our approach achieves state-of-the-art results on thechallenging OccludedLINEMOD dataset. Our code and dataset are available athttps://rse-lab.cs.washington.edu/projects/posecnn/.