8 months ago

Video Understanding

Visual Question Answering

Computer Vision

Young Jin Ahn Jungwoo Park Sangha Park Jonghyun Choi Kee-Eung Kim

Abstract

Visual Speech Recognition (VSR) stands at the intersection of computer visionand speech recognition, aiming to interpret spoken content from visual cues. Aprominent challenge in VSR is the presence of homophenes-visually similar lipgestures that represent different phonemes. Prior approaches have sought todistinguish fine-grained visemes by aligning visual and auditory semantics, butoften fell short of full synchronization. To address this, we present SyncVSR,an end-to-end learning framework that leverages quantized audio for frame-levelcrossmodal supervision. By integrating a projection layer that synchronizesvisual representation with acoustic data, our encoder learns to generatediscrete audio tokens from a video sequence in a non-autoregressive manner.SyncVSR shows versatility across tasks, languages, and modalities at the costof a forward pass. Our empirical evaluations show that it not only achievesstate-of-the-art results but also reduces data usage by up to ninefold.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Video Understanding

Visual Question Answering

Computer Vision

Young Jin Ahn Jungwoo Park Sangha Park Jonghyun Choi Kee-Eung Kim

Abstract

Visual Speech Recognition (VSR) stands at the intersection of computer visionand speech recognition, aiming to interpret spoken content from visual cues. Aprominent challenge in VSR is the presence of homophenes-visually similar lipgestures that represent different phonemes. Prior approaches have sought todistinguish fine-grained visemes by aligning visual and auditory semantics, butoften fell short of full synchronization. To address this, we present SyncVSR,an end-to-end learning framework that leverages quantized audio for frame-levelcrossmodal supervision. By integrating a projection layer that synchronizesvisual representation with acoustic data, our encoder learns to generatediscrete audio tokens from a video sequence in a non-autoregressive manner.SyncVSR shows versatility across tasks, languages, and modalities at the costof a forward pass. Our empirical evaluations show that it not only achievesstate-of-the-art results but also reduces data usage by up to ninefold.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp