8 months ago

Video Understanding

Audio and Speech Processing

Computer Vision

Oscar Chang Hank Liao Dmitriy Serdyuk Ankit Shah† Olivier Siohan

Abstract

Visual speech recognition models extract visual features in a hierarchicalmanner. At the lower level, there is a visual front-end with a limited temporalreceptive field that processes the raw pixels depicting the lips or faces. Atthe higher level, there is an encoder that attends to the embeddings producedby the front-end over a large temporal receptive field. Previous work hasfocused on improving the visual front-end of the model to extract more usefulfeatures for speech recognition. Surprisingly, our work shows that complexvisual front-ends are not necessary. Instead of allocating resources to asophisticated visual front-end, we find that a linear visual front-end pairedwith a larger Conformer encoder results in lower latency, more efficient memoryusage, and improved WER performance. We achieve a new state-of-the-art of 12.8%WER for visual speech recognition on the TED LRS3 dataset, which rivals theperformance of audio-only models from just four years ago.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Video Understanding

Audio and Speech Processing

Computer Vision

Oscar Chang Hank Liao Dmitriy Serdyuk Ankit Shah† Olivier Siohan

Abstract

Visual speech recognition models extract visual features in a hierarchicalmanner. At the lower level, there is a visual front-end with a limited temporalreceptive field that processes the raw pixels depicting the lips or faces. Atthe higher level, there is an encoder that attends to the embeddings producedby the front-end over a large temporal receptive field. Previous work hasfocused on improving the visual front-end of the model to extract more usefulfeatures for speech recognition. Surprisingly, our work shows that complexvisual front-ends are not necessary. Instead of allocating resources to asophisticated visual front-end, we find that a linear visual front-end pairedwith a larger Conformer encoder results in lower latency, more efficient memoryusage, and improved WER performance. We achieve a new state-of-the-art of 12.8%WER for visual speech recognition on the TED LRS3 dataset, which rivals theperformance of audio-only models from just four years ago.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp