HyperAIHyperAI

Command Palette

Search for a command to run...

高速バイト潜在変換器

Julie Kallini Artidoro Pagnoni Tomasz Limisiewicz Gargi Ghosh Luke Zettlemoyer Christopher Potts Xiaochuang Han Srinivasan Iyer

概要

最近のバイトレベル言語モデル(LM)は、サブワード語彙に依存することなくトークンレベルのモデルと同等の性能を発揮するが、1バイトずつ自律的に生成する低速な生成プロセスにより、実用性が制限されていた。本研究では、Byte Latent Transformer (BLT) において、新しい学習手法と生成手法を導入することで、このボトルネックを解消する。まず、補助的なブロック単位拡散目的(block-wise diffusion objective)を標準的な次バイト予測損失(next-byte prediction loss)と併用して訓練した新たなモデル、そして我々が開発した最速のBLTバリエーションであるBLT Diffusion (BLT-D) を提案する。このアプローチにより、各デコーディングステップで複数のバイトを並列に生成する推論プロシージャが可能となり、系列を生成するために必要なフォワードパスの回数が大幅に削減される。第二に、推論速度を一部犠牲にしてより高い生成品質を実現する、スペキュレーティブ・ディコーディング(speculative decoding)に着想を得た2つの拡張手法を提案する。1つ目は、BLTのローカルデコーダが通常のパッチ境界を超えて継続的にバイトをドラフトし、その結果をモデル全体を用いた単一のフォワードパスで検証するBLT Self-speculation (BLT-S) である。もう1つは、BLT-Dに拡散ベースの生成に続くオートレジッシブ検証ステップを追加したBLT Diffusion+Verification (BLT-DV) である。これらすべての手法は、生成タスクにおいてBLT 대비メモリ帯域コストを50%以上削減できると推定される。各アプローチは独自の利点を持ち、これらを組み合わせることで、バイトレベルLLMの実践的な利用における主要な障壁を除去する。

One-sentence Summary

Addressing the slow byte-by-byte autoregressive generation limiting byte-level language models, this work enhances the Byte Latent Transformer (BLT) with BLT Diffusion (BLT-D), BLT Self-speculation (BLT-S), and BLT Diffusion+Verification (BLT-DV), which utilize parallel generation and verification and may achieve an estimated memory-bandwidth cost over 50% lower than BLT while removing key barriers to the practical use of subword-free models.

Key Contributions

  • The paper introduces BLT Diffusion (BLT-D), a variant trained with an auxiliary block-wise diffusion objective alongside standard next-byte prediction to enable parallel byte generation. This design substantially reduces the number of forward passes required to generate a sequence compared to traditional autoregressive approaches.
  • BLT Self-speculation (BLT-S) leverages the existing local decoder to draft bytes past normal patch boundaries without requiring a separate draft model for verification. This extension reduces the number of expensive encoder calls while preserving the output quality of standard autoregressive decoding.
  • BLT Diffusion+Verification (BLT-DV) combines fast diffusion drafting with an autoregressive verification step to occupy a middle point in the speed and performance trade-off. Collectively, the methods may achieve an estimated memory-bandwidth cost over 50% lower than the standard BLT on generation tasks.

Introduction

Byte-level language models operate directly on raw bytes to avoid subword tokenization issues like noise sensitivity and multilingual disparities. Despite these benefits, prior work suffers from inefficient inference where sequential byte-by-byte generation creates a memory bandwidth bottleneck. The authors address this by introducing BLT Diffusion, which enables parallel byte generation through block-wise diffusion objectives. They further develop BLT Self-speculation and BLT Diffusion+Verification to balance speed with quality without relying on external draft models. These methods collectively reduce memory-bandwidth costs by over 50% and remove key barriers to practical deployment.

Dataset

  • Dataset Composition and Structure
    • The authors use raw training samples formatted as byte sequences segmented into variable-length patches.
    • The data structure consists of fixed-length blocks constructed from these patches to enable block-wise masked prediction.
  • Preprocessing and Construction Details
    • An entropy patcher dynamically segments the input to define patch boundaries.
    • Blocks are created by taking consecutive bytes starting at patch indices and often extend beyond the original patch size.
    • Special padding tokens are applied when blocks exceed the sequence length.
    • Original byte positional indices are recorded to ensure correct RoPE positional encodings in the decoder.
  • Training Usage and Masking Strategy
    • The model employs a diffusion process where a continuous timestep is sampled during training.
    • Bytes are independently replaced with [MASK] tokens based on the sampled probability to create a corrupted input.
    • This setup allows the model to reconstruct the clean sequence from the corrupted input during inference.

Method

The Byte-Level Transformer (BLT) operates directly on raw byte sequences, utilizing a hierarchical architecture to balance efficiency and performance. The model consists of three primary components: a local encoder E\mathcal{E}E, a global transformer G\mathcal{G}G, and a local decoder D\mathcal{D}D. The local encoder embeds the input byte sequence into initial representations, which are then processed into latent token representations by the global model. These latent tokens are subsequently decoded back into bytes by the local decoder. Refer to the framework diagram below for a visualization of this interaction between the local and global components.

To enable efficient block diffusion decoding, BLT-D introduces a specialized training pipeline. The process begins with dynamic patch segmentation, where raw training samples are split into variable-length patches based on entropy. These patches are then extended into fixed-size blocks and corrupted with [MASK] tokens to create the training input. Refer to the figure below for the step-by-step data preprocessing workflow.

During the training forward pass, the model processes both clean and corrupted inputs. The encoder and global model handle the clean input to produce latent representations, which are then used by the decoder. The decoder applies cross-attention to these latent tokens while employing specific attention masks: causal attention for the clean sequence and bidirectional attention within the corrupted blocks. The total training objective combines a next-byte prediction loss for the clean sequence with a masked diffusion loss for the corrupted blocks. The complete training architecture is illustrated below.

Inference for BLT-D relies on carefully constructed attention masks to support block diffusion. For the decoder's cross-attention, clean positions attend to their corresponding latent tokens, while masked block positions attend to the last available latent token. For self-attention, the clean prefix uses a causal mask, whereas the corrupted block utilizes a fully bidirectional mask. These patterns are visualized in the mask diagrams below.

Finally, the authors propose extensions like BLT-S and BLT-DV to further enhance efficiency through speculative decoding. In this paradigm, the model drafts a sequence of tokens using a fast mechanism, such as diffusion or extended autoregressive decoding, and then verifies these drafts using a slower, more accurate pass. The drafting stage proposes candidate bytes, and the verification stage accepts or rejects them based on the model's predictions. This iterative process is depicted in the figure below.

Experiment

The experiments evaluate byte-level language models across translation and code generation tasks using 1B and 3B parameter scales to assess the trade-offs between inference speed and generation quality. Results indicate that the BLT-D framework significantly improves efficiency by reducing memory bandwidth and network function evaluations compared to standard autoregressive baselines, though larger block sizes may slightly degrade coding performance while maintaining translation accuracy. Additional evaluations confirm that these diffusion-based methods preserve autoregressive capabilities on reasoning benchmarks and allow for adjustable control over the balance between output diversity and computational cost.

The the the table presents likelihood-based evaluation results for 1B parameter models across five standard language understanding and reasoning benchmarks. The baseline model consistently outperforms the diffusion-based variants, though the variants maintain competitive scores that approach the baseline performance. This indicates that the diffusion mechanism preserves strong autoregressive capabilities despite a minor trade-off in accuracy. The baseline model achieves the highest scores across all five evaluated benchmarks compared to the diffusion variants. Increasing the diffusion block size generally leads to a slight reduction in performance scores across the datasets. Diffusion variants demonstrate robust performance on reasoning tasks, remaining close to the baseline despite the architectural changes.

The authors compare the efficiency and quality of standard autoregressive generation against diffusion-based and speculative inference methods. The data shows that diffusion models significantly reduce memory bandwidth and network function evaluations compared to the baseline, with efficiency increasing as the diffusion block size grows. However, this speed improvement correlates with a drop in task performance, which can be partially recovered by adding a verification step. Speculative inference achieves significant memory savings without compromising task performance relative to the baseline. Larger diffusion block sizes yield greater reductions in computational cost but result in lower generation quality. Verification mechanisms improve the accuracy of diffusion models but require additional computational resources compared to diffusion-only settings.

The the the table compares the baseline BLT 3B model with diffusion-based variants and speculation extensions, highlighting the trade-offs between inference efficiency and generation quality. While diffusion-only models significantly reduce memory bandwidth and network function evaluations, they generally exhibit lower BLEU scores than the autoregressive baseline. The verification-based extension recovers much of this performance loss while maintaining substantial efficiency gains, and the self-speculation method matches baseline performance with improved efficiency. Increasing the diffusion block size improves efficiency metrics like memory bandwidth and NFEs but leads to a decline in task performance scores. The verification-based BLT-DV variant achieves higher generation quality than the diffusion-only BLT-D models while still offering significant reductions in memory usage compared to the baseline. The self-speculation BLT-S method maintains the baseline model's task performance while substantially reducing memory bandwidth and global network function evaluations.

The authors evaluate inference extensions including BLT-S, BLT-D, and BLT-DV against a baseline autoregressive model to characterize speed-quality trade-offs. Results indicate that diffusion-based methods substantially reduce memory bandwidth and network function evaluations, though often at the cost of task performance metrics. Verification-based approaches help recover some of this performance loss while retaining significant efficiency improvements over the standard baseline. Diffusion-only models achieve the highest efficiency gains, particularly as block sizes increase, leading to the lowest memory bandwidth usage. Adding verification to diffusion models improves generation quality scores compared to diffusion-only variants, though it increases global network function evaluations. The self-speculation method maintains task performance levels similar to the baseline while still delivering notable reductions in memory bandwidth requirements.

The authors evaluate various inference strategies for language models, comparing diffusion-based and speculative methods against a standard autoregressive baseline. Results show that diffusion-based approaches significantly reduce memory bandwidth and computational steps, with efficiency gains increasing as block sizes grow. Verification-based extensions help recover task performance lost in pure diffusion settings while preserving most of the efficiency benefits. Diffusion-based models consistently achieve lower memory bandwidth usage and fewer network function evaluations than the autoregressive baseline. Adding verification to diffusion generation improves task quality but requires more computational resources than diffusion-only methods. Speculative generation maintains performance comparable to the baseline while offering substantial reductions in memory bandwidth requirements.

The experiments evaluate standard autoregressive baselines against diffusion-based and speculative inference methods across language understanding and generation tasks. Although the baseline consistently achieves the highest accuracy, diffusion variants significantly reduce memory bandwidth and computational evaluations, with larger block sizes yielding greater efficiency but lower quality. Verification mechanisms help recover performance losses in diffusion models, whereas speculative methods maintain baseline performance levels while offering substantial efficiency gains, ultimately highlighting a trade-off between inference speed and generation quality that can be balanced through architectural extensions.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています