SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

Existing zero-shot skeleton-based action recognition methods utilizeprojection networks to learn a shared latent space of skeleton features andsemantic embeddings. The inherent imbalance in action recognition datasets,characterized by variable skeleton sequences yet constant class labels,presents significant challenges for alignment. To address the imbalance, wepropose SA-DVAE -- Semantic Alignment via Disentangled VariationalAutoencoders, a method that first adopts feature disentanglement to separateskeleton features into two independent parts -- one is semantic-related andanother is irrelevant -- to better align skeleton and semantic features. Weimplement this idea via a pair of modality-specific variational autoencoderscoupled with a total correction penalty. We conduct experiments on threebenchmark datasets: NTU RGB+D, NTU RGB+D 120 and PKU-MMD, and our experimentalresults show that SA-DAVE produces improved performance over existing methods.The code is available at https://github.com/pha123661/SA-DVAE.