8 months ago

Abstract

One of the recent advances in surgical AI is the recognition of surgicalactivities as triplets of (instrument, verb, target). Albeit providing detailedinformation for computer-assisted intervention, current triplet recognitionapproaches rely only on single frame features. Exploiting the temporal cuesfrom earlier frames would improve the recognition of surgical action tripletsfrom videos. In this paper, we propose Rendezvous in Time (RiT) - a deeplearning model that extends the state-of-the-art model, Rendezvous, withtemporal modeling. Focusing more on the verbs, our RiT explores theconnectedness of current and past frames to learn temporal attention-basedfeatures for enhanced triplet recognition. We validate our proposal on thechallenging surgical triplet dataset, CholecT45, demonstrating an improvedrecognition of the verb and triplet along with other interactions involving theverb such as (instrument, verb). Qualitative results show that the RiT producessmoother predictions for most triplet instances than the state-of-the-arts. Wepresent a novel attention-based approach that leverages the temporal fusion ofvideo frames to model the evolution of surgical actions and exploit theirbenefits for surgical triplet recognition.

Source PDF