Surgical Triplet Recognition via Diffusion Model

Surgical triplet recognition is an essential building block to enablenext-generation context-aware operating rooms. The goal is to identify thecombinations of instruments, verbs, and targets presented in surgical videoframes. In this paper, we propose DiffTriplet, a new generative framework forsurgical triplet recognition employing the diffusion model, which predictssurgical triplets via iterative denoising. To handle the challenge of tripletassociation, two unique designs are proposed in our diffusion framework, i.e.,association learning and association guidance. During training, we optimize themodel in the joint space of triplets and individual components to capture thedependencies among them. At inference, we integrate association constraintsinto each update of the iterative denoising process, which refines the tripletprediction using the information of individual components. Experiments on theCholecT45 and CholecT50 datasets show the superiority of the proposed method inachieving a new state-of-the-art performance for surgical triplet recognition.Our codes will be released.