Rendezvous: Attention Mechanisms for the Recognition of Surgical Action Triplets in Endoscopic Videos

Out of all existing frameworks for surgical workflow analysis in endoscopicvideos, action triplet recognition stands out as the only one aiming to providetruly fine-grained and comprehensive information on surgical activities. Thisinformation, presented as combinations, is highlychallenging to be accurately identified. Triplet components can be difficult torecognize individually; in this task, it requires not only performingrecognition simultaneously for all three triplet components, but also correctlyestablishing the data association between them. To achieve this task, weintroduce our new model, the Rendezvous (RDV), which recognizes tripletsdirectly from surgical videos by leveraging attention at two different levels.We first introduce a new form of spatial attention to capture individual actiontriplet components in a scene; called Class Activation Guided AttentionMechanism (CAGAM). This technique focuses on the recognition of verbs andtargets using activations resulting from instruments. To solve the associationproblem, our RDV model adds a new form of semantic attention inspired byTransformer networks; called Multi-Head of Mixed Attention (MHMA). Thistechnique uses several cross and self attentions to effectively capturerelationships between instruments, verbs, and targets. We also introduceCholecT50 - a dataset of 50 endoscopic videos in which every frame has beenannotated with labels from 100 triplet classes. Our proposed RDV modelsignificantly improves the triplet prediction mean AP by over 9% compared tothe state-of-the-art methods on this dataset.