8 months ago

Multi-Task Learning

Computer Vision

Method/Architecture

Computer Vision

C. Li J. Zhang

Abstract

Monocular 3D human pose estimation technologies have the potential to greatlyincrease the availability of human movement data. The best-performing modelsfor single-image 2D-3D lifting use graph convolutional networks (GCNs) thattypically require some manual input to define the relationships betweendifferent body joints. We propose a novel transformer-based approach that usesthe more generalised self-attention mechanism to learn these relationshipswithin a sequence of tokens representing joints. We find that the use ofintermediate supervision, as well as residual connections between the stackedencoders benefits performance. We also suggest that using error prediction aspart of a multi-task learning framework improves performance by allowing thenetwork to compensate for its confidence level. We perform extensive ablationstudies to show that each of our contributions increases performance.Furthermore, we show that our approach outperforms the recent state of the artfor single-frame 3D human pose estimation by a large margin. Our code andtrained models are made publicly available on Github.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Multi-Task Learning

Computer Vision

Method/Architecture

Computer Vision

C. Li J. Zhang

Abstract

Monocular 3D human pose estimation technologies have the potential to greatlyincrease the availability of human movement data. The best-performing modelsfor single-image 2D-3D lifting use graph convolutional networks (GCNs) thattypically require some manual input to define the relationships betweendifferent body joints. We propose a novel transformer-based approach that usesthe more generalised self-attention mechanism to learn these relationshipswithin a sequence of tokens representing joints. We find that the use ofintermediate supervision, as well as residual connections between the stackedencoders benefits performance. We also suggest that using error prediction aspart of a multi-task learning framework improves performance by allowing thenetwork to compensate for its confidence level. We perform extensive ablationstudies to show that each of our contributions increases performance.Furthermore, we show that our approach outperforms the recent state of the artfor single-frame 3D human pose estimation by a large margin. Our code andtrained models are made publicly available on Github.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

Jointformer: Single-Frame Lifting Transformer with Error Prediction and Refinement for 3D Human Pose Estimation | Papers | HyperAI