1年前

Harveen Singh Chadha Aswin Shanmugam Subramanian Vikas Joshi Shubham Bansal Jian Xue Rupeshkumar Mehta Jinyu Li

Linly 吹き替えのワンクリックデプロイ：ワンクリック動画ダウンロード＋翻訳＋吹き替え＋字幕

RTX 5090のコンピュートリソースがわずか20時間分 $1 (価値 $7)

概要

動画吹き替えにおいて、翻訳された音声と元の音声を同期させることは重要な課題である。本研究の焦点は、リアルタイムかつデバイス上での動画吹き替えシナリオに適応した効率的な同期の実現にある。私たちは、事前定義されたタグを用いて、短、標準、長の3種類の異なる長さの翻訳を生成する音素ベースのエンドツーエンドの長さ感知型音声翻訳（LSST）モデルを開発した。さらに、単一のデコーディングパスで異なる長さの翻訳を生成する効率的な手法である長さ感知ビームサーチ（LABS）を導入した。このアプローチは、長さ感知を備えていないベースラインと比較して同等のBLEUスコアを維持しつつ、ソース音声とターゲット音声の同期品質を大幅に向上させ、スペイン語では平均意見点（MOS）が0.34、韓国語では0.65それぞれ向上した。

One-sentence Summary

The authors propose a phoneme-based end-to-end length-sensitive speech translation (LSST) model for real-time, on-device video dubbing that employs predefined length tags and a length-aware beam search (LABS) decoder to generate short, normal, and long translations in a single pass, maintaining comparable BLEU scores to a baseline without length awareness while achieving mean opinion score synchronization gains of 0.34 for Spanish and 0.65 for Korean.

Key Contributions

A phoneme-based Length-Sensitive Speech Translation (LSST) model generates short, normal, and long candidate translations using predefined length tags. This phoneme-based length ratio method provides a consistent and scalable representation for cross-lingual duration modeling.
A Length-Aware Beam Search (LABS) decoding strategy produces varying-length translations within a single decoding pass. This approach eliminates the computational overhead of multiple decoding iterations, significantly reducing latency and system complexity for real-time on-device deployment.
The framework maintains comparable BLEU scores to standard baselines while substantially improving audio synchronization. The method achieves mean opinion score gains of 0.34 for Spanish and 0.65 for Korean, confirming its effectiveness for temporally aligned video dubbing.

Introduction

End-to-end speech-to-text translation has become essential for automatic video dubbing, a technology that significantly improves global content accessibility. Successful dubbing demands precise temporal alignment between source and target audio, but speech duration naturally varies across languages, often causing mismatched pacing and unnatural delivery. Prior length-control methods typically rely on text-to-text translation, character-based modeling that lacks cross-lingual consistency, or multi-pass decoding strategies that introduce unacceptable latency for real-time, on-device deployment. To overcome these hurdles, the authors introduce a Length-Sensitive Speech Translation framework paired with Length-Aware Beam Search, which efficiently generates short, normal, and long translation candidates in a single decoding pass using phoneme-based length ratios to ensure accurate and fluid temporal alignment.

Method

The authors leverage a phoneme-based end-to-end length-sensitive speech translation (LSST) model designed to generate translations of varying lengths—short, normal, and long—tailored for real-time, on-device video dubbing scenarios. The core of the approach lies in conditioning the translation process on predefined length control tokens: <short>, <normal>, and <long>. These tokens are prepended to each target translation during training, replacing the standard Start of Sequence (SOS) token. The assignment of a length tag is determined by the ratio of the target-to-source text length, computed using phoneme counts to ensure cross-lingual consistency. Specifically, the length tag $\ell$ is assigned as follows:

\ell = \begin{cases} <s h o r t> & \text{if } r < 1 - \alpha \\ <n o r m a l> & \text{if } 1 - \alpha \leq r \leq 1 + \alpha \\ <l o n g> & \text{if } r > 1 + \alpha \end{cases}

where $r$ denotes the target-to-source length ratio and $\alpha = 0.1$ . Phoneme-based length computation is preferred over character-based methods, particularly when dealing with languages using different writing systems such as English and Korean, due to structural differences in orthography that make character counts less reliable for duration estimation.

During inference, the LSST model can generate multiple length variants by conditioning on the respective length token. However, generating all three variants independently is computationally expensive. To address this, the authors introduce the length-aware beam search (LABS) algorithm, which enables the efficient generation of all length variants in a single decoding pass.

As shown in the figure below: the LABS algorithm modifies standard beam search by initializing the beam with the length-specific tokens $\mathcal{L} = \{s, n, l\}$ , corresponding to <short>, <normal>, and <long>. At each time step $t$ , the algorithm expands the beam by generating new hypotheses for each length tag $\ell \in \mathcal{L}$ , extending each hypothesis $b$ in the length-specific sub-beam $B_t^{(\ell)}$ with every token $v$ in the target vocabulary $V$ . The set of new hypotheses for each length tag is given by:

B_{t+1}^{(\ell)} = \{ (b \oplus v, S_{t+1}(\ell)) \mid b \in B_t^{(\ell)}, v \in V \}

where $\oplus$ denotes token appending and $S_{t+1}(\ell)$ is the updated score of the extended hypothesis, calculated as:

S_{t+1}(\ell) = S_t(\ell) + \log P(v \mid b, \ell, x)

The complete beam at time step $t+1$ is formed by the union of all length-specific sub-beams:

B_{t+1} = \bigcup_{\ell \in \mathcal{L}} B_{t+1}^{(\ell)}

Pruning is performed in a length-aware manner to maintain diversity while controlling beam size. The pruning function $\mathrm{Prune}(B_{t+1}, N, \mathcal{L})$ selects the top- $N$ hypotheses from the combined beam $B_{t+1}$ , ensuring that at least one hypothesis from each length tag is preserved, if available. The selection process prioritizes high-scoring hypotheses while enforcing diversity across length categories. Specifically, the top three hypotheses from each length tag are included, and the remaining $N - 3$ candidates are selected based on score. If fewer than $N$ hypotheses exist across all tags, all are retained. Additionally, hypotheses that have reached the End-of-Sequence token $\langle \text{EOS} \rangle$ are given priority to ensure complete translations are considered in the final candidate set.

The algorithm iterates through steps of beam expansion and pruning until a maximum length $T$ is reached or all hypotheses have generated $\langle \text{EOS} \rangle$ . The final n-best list $\hat{H}$ is selected from the set of completed hypotheses $B_{\text{final}}$ using a selection function $\mathrm{SelectNBest}(B_{\text{final}}, N, \mathcal{L})$ , which ensures representation from each length tag if possible, while ranking primarily by score. This approach enables efficient generation of multiple length variants in a single decoding pass, facilitating synchronization with the source audio duration without compromising translation quality.

Experiment

The evaluation setup employs a multilingual speech-to-text model trained on Spanish and Korean data, assessed via the FLEURS test set to measure translation quality and temporal alignment. The first set of experiments validates phoneme-based length tokens as an effective unit for regulating output duration while preserving translation fidelity. The second set of experiments validates the LABS decoding strategy, demonstrating substantial gains in speech rate compliance and perceived synchronization without compromising fluency or introducing significant latency. Collectively, these findings confirm that integrating length-aware decoding with phoneme-based controls produces more natural, well-timed translations that closely match source audio pacing.

The authors evaluate a length-sensitive speech translation model using character and phoneme-based length tokens, comparing performance against a baseline. Results show that both length token types maintain translation quality while enabling control over output length, with phoneme-based approaches performing similarly to character-based ones. The proposed LABS method improves speech rate compliance and synchronization quality, particularly for Korean, while maintaining low latency and high translation quality. Both character and phoneme-based length tokens achieve similar translation quality to the baseline while enabling length control. Phoneme-based length tokens perform comparably to character-based ones, supporting the use of phonemes for length-sensitive translation. The LABS method improves speech rate compliance and synchronization quality, especially for Korean, with minimal latency increase.

The authors evaluate a length-sensitive speech translation model with different length control mechanisms, comparing baseline and LSST approaches using character and phoneme-based length units. Results show that the LABS method improves speech rate compliance while maintaining translation quality, with consistent improvements across languages and a minimal increase in latency. LABS improves speech rate compliance significantly over the baseline for both Spanish and Korean. The LABS method maintains translation quality with only a marginal decline in BLEU scores for Spanish and an improvement for Korean. LABS achieves better synchronization with minimal latency increase compared to traditional beam search.

The experiments evaluate a length-sensitive speech translation framework by comparing a standard baseline against a proposed length-aware alignment method that utilizes both character and phoneme-based tokens for output length control. These evaluations validate that both token types effectively preserve translation quality while enabling precise length management, with phoneme-based units performing comparably to character-based alternatives. Ultimately, the proposed approach significantly enhances speech rate compliance and audio-text synchronization across tested languages while maintaining minimal processing latency and robust overall translation performance.

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

HyperAI

このノートブックを実行 Discordで議論

1年前

Harveen Singh Chadha Aswin Shanmugam Subramanian Vikas Joshi Shubham Bansal Jian Xue Rupeshkumar Mehta Jinyu Li

Linly 吹き替えのワンクリックデプロイ：ワンクリック動画ダウンロード＋翻訳＋吹き替え＋字幕

RTX 5090のコンピュートリソースがわずか20時間分 $1 (価値 $7)

ノートブックへ移動

概要

One-sentence Summary

Key Contributions

A phoneme-based Length-Sensitive Speech Translation (LSST) model generates short, normal, and long candidate translations using predefined length tags. This phoneme-based length ratio method provides a consistent and scalable representation for cross-lingual duration modeling.
A Length-Aware Beam Search (LABS) decoding strategy produces varying-length translations within a single decoding pass. This approach eliminates the computational overhead of multiple decoding iterations, significantly reducing latency and system complexity for real-time on-device deployment.
The framework maintains comparable BLEU scores to standard baselines while substantially improving audio synchronization. The method achieves mean opinion score gains of 0.34 for Spanish and 0.65 for Korean, confirming its effectiveness for temporally aligned video dubbing.

Introduction

Method

\ell = \begin{cases} <s h o r t> & \text{if } r < 1 - \alpha \\ <n o r m a l> & \text{if } 1 - \alpha \leq r \leq 1 + \alpha \\ <l o n g> & \text{if } r > 1 + \alpha \end{cases}

B_{t+1}^{(\ell)} = \{ (b \oplus v, S_{t+1}(\ell)) \mid b \in B_t^{(\ell)}, v \in V \}

where $\oplus$ denotes token appending and $S_{t+1}(\ell)$ is the updated score of the extended hypothesis, calculated as:

S_{t+1}(\ell) = S_t(\ell) + \log P(v \mid b, \ell, x)

The complete beam at time step $t+1$ is formed by the union of all length-specific sub-beams:

B_{t+1} = \bigcup_{\ell \in \mathcal{L}} B_{t+1}^{(\ell)}

Experiment

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

Command Palette

動画吹き替えのための長さ認識型音声翻訳

Harveen Singh Chadha Aswin Shanmugam Subramanian Vikas Joshi Shubham Bansal Jian Xue Rupeshkumar Mehta Jinyu Li

Linly 吹き替えのワンクリックデプロイ：ワンクリック動画ダウンロード＋翻訳＋吹き替え＋字幕

概要

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

動画吹き替えのための長さ認識型音声翻訳

Harveen Singh Chadha Aswin Shanmugam Subramanian Vikas Joshi Shubham Bansal Jian Xue Rupeshkumar Mehta Jinyu Li

Linly 吹き替えのワンクリックデプロイ：ワンクリック動画ダウンロード＋翻訳＋吹き替え＋字幕

概要

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

動画吹き替えのための長さ認識型音声翻訳

Harveen Singh Chadha Aswin Shanmugam Subramanian Vikas Joshi Shubham Bansal Jian Xue Rupeshkumar Mehta Jinyu Li

Linly 吹き替えのワンクリックデプロイ：ワンクリック動画ダウンロード＋翻訳＋吹き替え＋字幕

概要

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AIでAIを構築

HyperAI Newsletters