HyperAIHyperAI

Command Palette

Search for a command to run...

LongSpeech: 장기 오디오의 필기, 번역 및 이해를 위한 확장 가능한 벤치마크

Fei Yang Xuanfan Ni Renyi Yang Jiahui Geng Qing Li Chenyang Lyu Yichao Du Longyue Wang Weihua Luo Kaifu Zhang

초록

최근 오디오-언어 모델의 발전은 짧고 세그먼트 수준의 음성 작업에서 뚜렷한 성과를 입증해 왔습니다. 그러나 회의 전사, 음성 문서 이해, 대화 분석과 같은 실제 세계 응용 분야에서는 장시간 음성을 처리하고 추론할 수 있는 강력한 모델이 필요합니다. 본 연구에서는 장기간의 오디오에 대한 음성 모델의 능력을 평가하고 발전시키기 위해 특별히 설계된 대규모 및 확장 가능한 벤치마크인 LongSpeech를 제시합니다. LongSpeech는 각각 약 10분 길이의 10만 개가 넘는 음성 세그먼트로 구성되어 있으며, ASR(자동 음성 인식), 음성 번역, 요약, 언어 감지, 화자 수 세기, 콘텐츠 분리, 질문 답변 등 풍부한 주석이 붙어 있습니다. 우리는 다양한 소스로부터 장기간 음성 벤치마크를 구축하기 위한 재현 가능한 파이프라인을 소개하여, 향후 확장을 가능하게 합니다. 최신 모델들을 대상으로 한 초기 실험 결과, 모델들은 종종 다른 작업들을 희생시키며 하나의 작업에 특화되거나, 고차원 추론 작업에서 어려움을 겪는 등 현저한 성능 격차를 보였습니다. 이러한 발견들은 본 벤치마크가 지닌 난이도를 강조합니다. 본 벤치마크는 연구 커뮤니티에 공개될 예정입니다.

One-sentence Summary

The authors present LongSpeech, a scalable benchmark comprising over 100,000 speech segments approximately 10 minutes long annotated for transcription, translation, and understanding, alongside a reproducible construction pipeline, with initial experiments revealing significant performance gaps in state-of-the-art models regarding higher-level reasoning for real-world applications such as meeting transcription.

Key Contributions

  • The work presents LongSpeech, a large-scale benchmark comprising over 100,000 speech segments approximately 10 minutes long with rich annotations for tasks like ASR and summarization. This dataset is designed to evaluate and advance the capabilities of speech models on long-duration audio across multiple domains.
  • A reproducible pipeline for constructing long-form speech benchmarks from diverse sources is introduced to enable future extensions. This method facilitates the replication and expansion of the benchmark construction process for subsequent studies.
  • Experiments conducted with state-of-the-art models reveal significant performance gaps, particularly in higher-level reasoning tasks such as summarization and temporal localization. These findings validate the challenging nature of the benchmark and highlight critical gaps in current models' ability to maintain context over extended audio streams.

Introduction

Advanced audio-language models face significant challenges when processing extended audio streams for real-world tasks. Existing systems often exhibit a trade-off between core functions like transcription and higher-level reasoning tasks such as summarization or temporal localization. To address these deficiencies, the authors introduce LongSpeech. This large-scale benchmark serves as a scalable evaluation platform designed to test and improve transcription, translation, and understanding capabilities in long-form speech.

Dataset

  • Dataset Composition and Sources

    • The authors construct LongSpeech using over 100,000 speech segments, each approximately 10 minutes long.
    • Sources include LibriSpeech, TED-LIUM v3, SPGISpeech, Vox-Populi, CommonVoice, AISHELL-2, IWSLT, and a custom movie dialogue corpus.
    • These corpora cover diverse domains, speakers, and languages under research-permissive licenses.
  • Curation and Processing Details

    • LibriSpeech and SPGISpeech data are grouped by speaker and chapter, concatenating sequentially to reach roughly 600 seconds.
    • CommonVoice segments utilize embedding-based selection with FAISS clustering to group semantically similar content.
    • VoxPopuli and AISHELL-2 prioritize supervised multi-speaker segments while filtering out short utterances.
    • The movie corpus uses text-to-speech synthesis to ensure diverse speaker and gender distributions.
    • Ground-truth transcriptions come from original datasets or high-quality generation models.
    • Metadata for speaker counting and language detection is inferred from corpus-level annotations.
  • Model Training and Splits

    • The benchmark evaluates eight tasks including ASR, translation, summarization, and question answering.
    • Data is partitioned into train, dev, and test sets following a 7:1.5:1.5 ratio.
    • Final splits aggregate examples from all tasks to ensure comprehensive representation.
    • The total set contains 142,200 training examples, 30,100 development examples, and 30,100 test examples.

Experiment

The study evaluates multiple foundation audio-language models using standard metrics across core speech tasks and higher-level understanding benchmarks. Results reveal significant performance gaps where models demonstrate specialization in areas like translation but struggle with long-form processing and deep semantic reasoning. Notably, systems frequently parse user intent correctly yet lack the precision to extract accurate information or track temporal issues, underscoring the necessity of the LongSpeech benchmark for identifying current limitations.

The the the table evaluates the performance of various audio-language models on speech recognition and translation tasks, revealing distinct specializations rather than uniform superiority across the board. While Whisper achieves the lowest error rates for recognition, it does not support translation, whereas Voxtral demonstrates the strongest translation capabilities. Other models like AudioFlamingo3 exhibit significantly higher error rates across both metrics compared to the specialized baselines. Voxtral delivers the strongest speech-to-text translation performance among the evaluated models. Whisper achieves the most accurate speech recognition but lacks translation functionality. AudioFlamingo3 shows the weakest performance with the highest error rates in both recognition and translation.

The authors evaluate audio-language models on tasks ranging from content separation to temporal localization, revealing significant limitations in current long-form speech processing. While models generally excel at parsing query intent, they struggle to generate accurate answers, particularly in reasoning-heavy tasks like temporal localization. Voxtral emerges as the strongest performer overall, although AudioFlamingo3 shows strength in language detection. Voxtral achieves the highest scores in emotion analysis, summarization, and temporal localization tasks. DashengLM demonstrates high parsability rates in speaker counting but fails to provide correct numeric answers. Language detection is the only task where AudioFlamingo3 outperforms the other models significantly.

These experiments evaluate audio language models on speech recognition, translation, and long form processing tasks, revealing distinct specializations rather than uniform superiority across the board. Voxtral emerges as the strongest overall performer, particularly in translation and reasoning tasks, whereas Whisper achieves the most accurate recognition but lacks translation functionality. While current models excel at parsing query intent, they exhibit significant limitations in generating accurate answers for complex reasoning, with AudioFlamingo3 showing the weakest performance despite its strength in language detection.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp