HyperAIHyperAI

Command Palette

Search for a command to run...

Granite-speech : des LLMs open-source sensibles à la parole dotés de fortes capacités d'ASR en anglais

Résumé

Voici la traduction française de votre texte, réalisée selon les standards de la communication scientifique et technologique :Les LLM Granite-speech sont des modèles de langage de parole compacts et efficaces, spécifiquement conçus pour l'ASR (reconnaissance automatique de la parole) et l'AST (traduction automatique de la parole) en anglais. Ces modèles ont été entraînés par alignement de modalité, en adaptant les variantes de 2B et 8B paramètres de granite-3.3-instruct à la parole, en utilisant des corpus open-source publics contenant des entrées audio et des cibles textuelles — composées soit de transcriptions humaines pour l'ASR, soit de traductions générées automatiquement pour l'AST. Des benchmarks complets démontrent que sur l'ASR anglais, qui était notre objectif principal, ils surpassent plusieurs modèles concurrents entraînés sur des ordres de grandeur de données propriétaires supplémentaires. Ils maintiennent également un niveau de performance compétitif en AST de l'anglais vers d'autres langues (English-to-X) pour les principales langues européennes, le japonais et le chinois.Les composants spécifiques à la parole sont les suivants : un encodeur acoustique Conformer utilisant l'attention par blocs (block attention) et l'auto-conditionnement (self-conditioning), entraîné avec la classification temporelle connexionniste (CTC) ; un adaptateur de modalité de parole de type windowed query-transformer utilisé pour effectuer le sous-échantillonnage temporel (temporal downsampling) des embeddings acoustiques et les projeter dans l'espace d'embedding textuel du LLM ; et des adaptateurs LoRA pour affiner davantage le LLM textuel.Granite-speech-3.3 fonctionne selon deux modes : en mode parole (speech mode), il réalise l'ASR et l'AST en activant l'encodeur, le projecteur et les adaptateurs LoRA ; en mode texte (text mode), il appelle directement le modèle sous-jacent granite-3.3-instruct (sans LoRA), préservant ainsi l'intégralité des capacités et de la sécurité du LLM textuel. Les deux modèles sont disponibles gratuitement sur HuggingFace et peuvent être utilisés à des fins de recherche comme commerciales sous une licence permissive Apache 2.0.

One-sentence Summary

By integrating a Conformer acoustic encoder with block attention and self-conditioning, a windowed query-transformer speech modality adapter, and LoRA fine-tuning, the Granite-speech series of compact, speech-aware LLMs achieves efficient English ASR and automatic speech translation that outperforms several larger proprietary models while preserving the original text-based capabilities of the Granite-3.3-instruct variants.

Key Contributions

  • The paper introduces Granite-speech, a family of compact speech-aware large language models in 2B and 8B parameter variants designed for English automatic speech recognition (ASR) and automatic speech translation (AST).
  • The architecture utilizes a specific speech-modality alignment strategy consisting of a Conformer acoustic encoder with block attention, a windowed query-transformer speech modality adapter for temporal downsampling, and LoRA adapters to fine-tune the underlying Granite-3.3-instruct model.
  • Experimental results demonstrate that these models outperform several competitors trained on significantly larger proprietary datasets in English ASR and maintain competitive performance in English-to-X translation for major European languages, Japanese, and Chinese.

Introduction

Modern spoken language models generally fall into two categories: early fusion models that integrate audio and text tokens directly, and speech-aware LLMs that use an acoustic encoder to map audio to a text-based LLM. While early fusion models offer high modality fluency, they often suffer from reduced instruction-following capabilities and increased safety risks due to limited text-based alignment. The authors leverage a speech-aware architecture to develop Granite-speech, a series of compact 2B and 8B parameter models designed for English automatic speech recognition and automatic speech translation. By using a conformer acoustic encoder and a windowed query-transformer adapter to align audio with the Granite-3.3-instruct backbone, the authors preserve the original text model's safety guardrails and reasoning capabilities while achieving competitive performance on English ASR tasks.

Dataset

  • Dataset Composition and Sources: The authors train their models on a combination of major publicly available English Automatic Speech Recognition (ASR) datasets and synthetic speech translation data. The ASR corpora include Multilingual LibriSpeech, Gigaspeech, CommonVoice 17.0, LibriSpeech, Voxpopuli, AMI, YODAS, SPGI Speech, Switchboard, CallHome, Fisher, Voicemail, and TED LIUM.

  • Synthetic Speech Translation Data: To support speech translation tasks, the authors generated synthetic data by translating English transcriptions from CommonVoice 17 into several languages, including French, Spanish, German, Italian, Portuguese, Japanese, and Chinese.

  • Data Processing and Filtering: The authors used an ensemble filtering strategy to ensure high quality in the synthetic translations. They employed Phi-4 as the primary translation model and MADLAD-3B/10B as the secondary model to calculate similarity between translation outputs. After testing various metrics, they selected Word Error Rate (WER) and Character Error Rate (CER) as the most effective thresholds for filtering. Specifically, they applied a WER threshold of 0.3 for English to German translations and a CER threshold of 0.4 for English to Japanese translations. This process retained less than half of the original CommonVoice data but ensured higher translation reliability.

  • Model Usage: The processed datasets are used to train the Granite-speech-3.3 models (both 2B and 8B parameter versions). The mixture of ASR and synthetic translation data allows the models to perform both speech recognition and speech translation tasks, with the 8B model demonstrating superior translation performance compared to the 2B variant.

Method

The Granite speech system is designed as a speech-aware large language model (LLM) capable of performing both automatic speech recognition (ASR) and automatic speech translation (AST). The architecture integrates several key components to bridge the gap between continuous acoustic signals and discrete text tokens.

The overall framework consists of an acoustic encoder, a speech modality adapter, and a Granite text LLM. The acoustic encoder converts the raw speech signal into high-level representations. These representations are then processed by the speech modality adapter, which serves as a temporal downsampler and maps the acoustic embeddings into a latent space interpretable by the text LLM. To adapt the LLM to the specific characteristics of these acoustic embeddings, the authors employ LoRA (Low-Rank Adaptation) adapters applied to the query and value projection matrices within the attention blocks of the LLM layers.

Refer to the framework diagram:

The speech modality adapter utilizes a two-layer window-level Q-former projector. This design is inspired by the SALMONN architecture and aims to convert variable-length acoustic sequences into a fixed number of trainable queries that attend to the acoustic embeddings. Given an acoustic embedding sequence X=x1xT\mathbf{X} = \mathbf{x}_1 \ldots \mathbf{x}_TX=x1xT of length TTT and NNN trainable queries Q=q1qN\mathbf{Q} = \mathbf{q}_1 \ldots \mathbf{q}_NQ=q1qN, the adapter processes the input in blocks of size KKK (where KNK \geq NKN and KmodN=0K \bmod N = 0KmodN=0). The transformation is defined as:

y(i1)N+1yiN=Qformer(Q,x(i1)K+1xiK),i=1T/K\begin{array} { r } { \mathbf { y } _ { ( i - 1 ) * N + 1 } \cdots \mathbf { y } _ { i * N } = \mathrm { Q - f o r m e r } ( \mathbf { Q } , \mathbf { x } _ { ( i - 1 ) * K + 1 } \dots \mathbf { x } _ { i * K } ) \, , } \\ { i = 1 \dots \lceil T / K \rceil } \end{array}y(i1)N+1yiN=Qformer(Q,x(i1)K+1xiK),i=1T/K

This mechanism effectively performs temporal downsampling by a factor of K/NK/NK/N. In the optimal configuration identified by the authors, a block size of K=15K=15K=15 frames and N=3N=3N=3 queries reduces the original 100 Hz logmel frame rate to a 10 Hz rate for the LLM.

To handle different tasks, the authors implement a task-specific prompt construction method using the Granite chat formatting syntax. The input sequence includes a system prompt, a user query, and a model response. For ASR and AST tasks, the user query contains a special audio\langle\text{audio}\rangleaudio token. During the forward pass, this token is replaced by the projected embeddings from the Q-former. For AST, the model supports both direct translation and a chain-of-thought (CoT) approach, where the model is prompted to first transcribe the speech and then translate it, using explicit tags to separate the steps.

The training process involves jointly optimizing the Q-former and the LoRA adapters while keeping the acoustic encoder frozen. The objective is the next-token prediction cross-entropy loss. To address potential data imbalances across different corpora, the authors utilize a balanced sampler. The sampling probability for a corpus iii is controlled by a factor α[0,1]\alpha \in [0,1]α[0,1], calculated as:

Niαj=1LNjα\frac{N_{i}^{\alpha}}{\sum_{j=1}^{L} N_{j}^{\alpha}}j=1LNjαNiα

By setting α=0.6\alpha=0.6α=0.6, the authors are able to flatten the natural data distribution, ensuring that smaller corpora are adequately represented during the fine-tuning phase.

Experiment

The researchers evaluated the encoder architecture by comparing different tokenization methods and model scales to optimize performance for joint LLM training. Their findings indicate that character-level tokenization is most effective for subsequent integration with large language models. Additionally, safety assessments demonstrate that the speech interface successfully maintains the refusal behaviors of the underlying text model, preventing the execution of harmful instructions even when presented with complex or noisy audio inputs.

The authors evaluate how different output tokenization methods affect the performance of CTC speech encoders both during greedy decoding and after joint LLM training. Results show that character-based tokenization leads to improved performance when integrated with a large language model compared to BERT or Granite tokenization. Character tokenization combined with LLM training achieves better performance across various datasets than other tokenization methods. Joint LLM training reduces error rates for all tested tokenization types compared to greedy decoding alone. The performance gains from LLM integration are consistent across multiple different audio corpora.

The authors compare the performance of different Granite Large Language Models across several datasets. The results indicate that the model size and version influence recognition accuracy across various audio corpora. The smallest model version shows slightly higher error rates in several categories compared to the larger versions Performance trends remain relatively consistent across the different model iterations for most datasets The AMIs dataset consistently shows higher error rates than the other tested corpora

The authors evaluate different projector architectures across several datasets to assess their performance. The results show that varying the number of projection heads or using an MLP yields similar error rates across most corpora. The performance remains relatively stable across different configurations of the QF projector. The x-attn projector tends to result in higher error rates compared to the other evaluated architectures. MLP and QF projectors show comparable performance trends across the majority of the tested datasets.

The authors compare the automatic speech recognition performance of two different encoder architectures across various datasets. Results show that increasing the number of layers generally leads to improved recognition accuracy across most tested corpora. The 16 layer encoder achieves lower error rates than the 10 layer encoder in several categories Performance improvements from increasing layers are observed in most of the evaluated datasets Both encoder configurations show varying levels of error rates depending on the specific corpus used

The authors evaluate the impact of tokenization methods, LLM model scales, projector architectures, and encoder depths on speech recognition performance across various datasets. The findings indicate that character-based tokenization combined with joint LLM training yields superior results, while larger model sizes and deeper encoder architectures consistently improve accuracy. Additionally, the study demonstrates that MLP and QF projectors offer stable performance across different configurations, whereas the x-attn architecture tends to result in higher error rates.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp