Hierarchical Vector Quantization for Unsupervised Action Segmentation

In this work, we address unsupervised temporal action segmentation, whichsegments a set of long, untrimmed videos into semantically meaningful segmentsthat are consistent across videos. While recent approaches combinerepresentation learning and clustering in a single step for this task, they donot cope with large variations within temporal segments of the same class. Toaddress this limitation, we propose a novel method, termed Hierarchical VectorQuantization (HVQ), that consists of two subsequent vector quantizationmodules. This results in a hierarchical clustering where the additionalsubclusters cover the variations within a cluster. We demonstrate that ourapproach captures the distribution of segment lengths much better than thestate of the art. To this end, we introduce a new metric based on theJensen-Shannon Distance (JSD) for unsupervised temporal action segmentation. Weevaluate our approach on three public datasets, namely Breakfast, YouTubeInstructional and IKEA ASM. Our approach outperforms the state of the art interms of F1 score, recall and JSD.