8 months ago

Audio and Speech Processing

Audio Classification

Audio Recognition

Desh Raj Daniel Povey Sanjeev Khudanpur

Abstract

Guided source separation (GSS) is a type of target-speaker extraction methodthat relies on pre-computed speaker activities and blind source separation toperform front-end enhancement of overlapped speech signals. It was firstproposed during the CHiME-5 challenge and provided significant improvementsover the delay-and-sum beamforming baseline. Despite its strengths, however,the method has seen limited adoption for meeting transcription benchmarksprimarily due to its high computation time. In this paper, we describe ourimproved implementation of GSS that leverages the power of modern GPU-basedpipelines, including batched processing of frequencies and segments, to provide300x speed-up over CPU-based inference. The improved inference time allows usto perform detailed ablation studies over several parameters of the GSSalgorithm -- such as context duration, number of channels, and noise class, toname a few. We provide end-to-end reproducible pipelines for speaker-attributedtranscription of popular meeting benchmarks: LibriCSS, AMI, and AliMeeting. Ourcode and recipes are publicly available: https://github.com/desh2608/gss.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Audio and Speech Processing

Audio Classification

Audio Recognition

Desh Raj Daniel Povey Sanjeev Khudanpur

Abstract

Guided source separation (GSS) is a type of target-speaker extraction methodthat relies on pre-computed speaker activities and blind source separation toperform front-end enhancement of overlapped speech signals. It was firstproposed during the CHiME-5 challenge and provided significant improvementsover the delay-and-sum beamforming baseline. Despite its strengths, however,the method has seen limited adoption for meeting transcription benchmarksprimarily due to its high computation time. In this paper, we describe ourimproved implementation of GSS that leverages the power of modern GPU-basedpipelines, including batched processing of frequencies and segments, to provide300x speed-up over CPU-based inference. The improved inference time allows usto perform detailed ablation studies over several parameters of the GSSalgorithm -- such as context duration, number of channels, and noise class, toname a few. We provide end-to-end reproducible pipelines for speaker-attributedtranscription of popular meeting benchmarks: LibriCSS, AMI, and AliMeeting. Ourcode and recipes are publicly available: https://github.com/desh2608/gss.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

GPU-accelerated Guided Source Separation for Meeting Transcription | Papers | HyperAI