2 months ago

Detailed 2D-3D Joint Representation for Human-Object Interaction

Li, Yong-Lu ; Liu, Xinpeng ; Lu, Han ; Wang, Shiyi ; Liu, Junqi ; Li, Jiefeng ; Lu, Cewu

Abstract

Human-Object Interaction (HOI) detection lies at the core of actionunderstanding. Besides 2D information such as human/object appearance andlocations, 3D pose is also usually utilized in HOI learning since itsview-independence. However, rough 3D body joints just carry sparse bodyinformation and are not sufficient to understand complex interactions. Thus, weneed detailed 3D body shape to go further. Meanwhile, the interacted object in3D is also not fully studied in HOI learning. In light of these, we propose adetailed 2D-3D joint representation learning method. First, we utilize thesingle-view human body capture method to obtain detailed 3D body, face and handshapes. Next, we estimate the 3D object location and size with reference to the2D human-object spatial configuration and object category priors. Finally, ajoint learning framework and cross-modal consistency tasks are proposed tolearn the joint HOI representation. To better evaluate the 2D ambiguityprocessing capacity of models, we propose a new benchmark named Ambiguous-HOIconsisting of hard ambiguous images. Extensive experiments in large-scale HOIbenchmark and Ambiguous-HOI show impressive effectiveness of our method. Codeand data are available at https://github.com/DirtyHarryLYL/DJ-RN.