SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

Recognizing an activity with a single reference sample using metric learningapproaches is a promising research field. The majority of few-shot methodsfocus on object recognition or face-identification. We propose a metriclearning approach to reduce the action recognition problem to a nearestneighbor search in embedding space. We encode signals into images and extractfeatures using a deep residual CNN. Using triplet loss, we learn a featureembedding. The resulting encoder transforms features into an embedding space inwhich closer distances encode similar actions while higher distances encodedifferent actions. Our approach is based on a signal level formulation andremains flexible across a variety of modalities. It further outperforms thebaseline on the large scale NTU RGB+D 120 dataset for the One-Shot actionrecognition protocol by 5.6%. With just 60% of the training data, our approachstill outperforms the baseline approach by 3.7%. With 40% of the training data,our approach performs comparably well to the second follow up. Further, we showthat our approach generalizes well in experiments on the UTD-MHAD dataset forinertial, skeleton and fused data and the Simitate dataset for motion capturingdata. Furthermore, our inter-joint and inter-sensor experiments suggest goodcapabilities on previously unseen setups.