HyperAIHyperAI
2 months ago

Poseidon: A ViT-based Architecture for Multi-Frame Pose Estimation with Adaptive Frame Weighting and Multi-Scale Feature Fusion

Pace, Cesare Davide ; De Nunzio, Alessandro Marco ; De Stefano, Claudio ; Fontanella, Francesco ; Molinara, Mario
Poseidon: A ViT-based Architecture for Multi-Frame Pose Estimation with
  Adaptive Frame Weighting and Multi-Scale Feature Fusion
Abstract

Human pose estimation, a vital task in computer vision, involves detectingand localising human joints in images and videos. While single-frame poseestimation has seen significant progress, it often fails to capture thetemporal dynamics for understanding complex, continuous movements. We proposePoseidon, a novel multi-frame pose estimation architecture that extends theViTPose model by integrating temporal information for enhanced accuracy androbustness to address these limitations. Poseidon introduces key innovations:(1) an Adaptive Frame Weighting (AFW) mechanism that dynamically prioritisesframes based on their relevance, ensuring that the model focuses on the mostinformative data; (2) a Multi-Scale Feature Fusion (MSFF) module thataggregates features from different backbone layers to capture both fine-graineddetails and high-level semantics; and (3) a Cross-Attention module foreffective information exchange between central and contextual frames, enhancingthe model's temporal coherence. The proposed architecture improves performancein complex video scenarios and offers scalability and computational efficiencysuitable for real-world applications. Our approach achieves state-of-the-artperformance on the PoseTrack21 and PoseTrack18 datasets, achieving mAP scoresof 88.3 and 87.8, respectively, outperforming existing methods.

Poseidon: A ViT-based Architecture for Multi-Frame Pose Estimation with Adaptive Frame Weighting and Multi-Scale Feature Fusion | Latest Papers | HyperAI