HyperAIHyperAI
2 months ago

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization

Ahn, Young Jin ; Park, Jungwoo ; Park, Sangha ; Choi, Jonghyun ; Kim, Kee-Eung
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End
  Crossmodal Audio Token Synchronization
Abstract

Visual Speech Recognition (VSR) stands at the intersection of computer visionand speech recognition, aiming to interpret spoken content from visual cues. Aprominent challenge in VSR is the presence of homophenes-visually similar lipgestures that represent different phonemes. Prior approaches have sought todistinguish fine-grained visemes by aligning visual and auditory semantics, butoften fell short of full synchronization. To address this, we present SyncVSR,an end-to-end learning framework that leverages quantized audio for frame-levelcrossmodal supervision. By integrating a projection layer that synchronizesvisual representation with acoustic data, our encoder learns to generatediscrete audio tokens from a video sequence in a non-autoregressive manner.SyncVSR shows versatility across tasks, languages, and modalities at the costof a forward pass. Our empirical evaluations show that it not only achievesstate-of-the-art results but also reduces data usage by up to ninefold.