ClickDiff: Click to Induce Semantic Contact Map for Controllable Grasp Generation with Diffusion Models

Grasp generation aims to create complex hand-object interactions with aspecified object. While traditional approaches for hand generation haveprimarily focused on visibility and diversity under scene constraints, theytend to overlook the fine-grained hand-object interactions such as contacts,resulting in inaccurate and undesired grasps. To address these challenges, wepropose a controllable grasp generation task and introduce ClickDiff, acontrollable conditional generation model that leverages a fine-grainedSemantic Contact Map (SCM). Particularly when synthesizing interactive grasps,the method enables the precise control of grasp synthesis through eitheruser-specified or algorithmically predicted Semantic Contact Map. Specifically,to optimally utilize contact supervision constraints and to accurately modelthe complex physical structure of hands, we propose a Dual GenerationFramework. Within this framework, the Semantic Conditional Module generatesreasonable contact maps based on fine-grained contact information, while theContact Conditional Module utilizes contact maps alongside object point cloudsto generate realistic grasps. We evaluate the evaluation criteria applicable tocontrollable grasp generation. Both unimanual and bimanual generationexperiments on GRAB and ARCTIC datasets verify the validity of our proposedmethod, demonstrating the efficacy and robustness of ClickDiff, even withpreviously unseen objects. Our code is available athttps://github.com/adventurer-w/ClickDiff.