HyperAI

SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation

Edoardo Bianchi, Antonio Liotta
Veröffentlichungsdatum: 5/14/2025
SkillFormer: Unified Multi-View Video Understanding for Proficiency
  Estimation
Abstract

Assessing human skill levels in complex activities is a challenging problemwith applications in sports, rehabilitation, and training. In this work, wepresent SkillFormer, a parameter-efficient architecture for unified multi-viewproficiency estimation from egocentric and exocentric videos. Building on theTimeSformer backbone, SkillFormer introduces a CrossViewFusion module thatfuses view-specific features using multi-head cross-attention, learnablegating, and adaptive self-calibration. We leverage Low-Rank Adaptation tofine-tune only a small subset of parameters, significantly reducing trainingcosts. In fact, when evaluated on the EgoExo4D dataset, SkillFormer achievesstate-of-the-art accuracy in multi-view settings while demonstrating remarkablecomputational efficiency, using 4.5x fewer parameters and requiring 3.75x fewertraining epochs than prior baselines. It excels in multiple structured tasks,confirming the value of multi-view integration for fine-grained skillassessment.