vor 2 Monaten

Inhaltsverzeichnis

Zusammenfassung

Transformers-basierte Sprachmodelle haben in einer Vielzahl von Aufgaben beachtliche Leistungen erzielt, dennoch stellt ihre hohe Inferenzlatenz eine erhebliche Herausforderung für Echtzeit- und großskalige Anwendungen dar. Obwohl bestehende Caching-Mechanismen, wie tokenbasierte Key-Value-Caches, bei der autoregressiven Dekodierung eine Beschleunigung bieten, sind sie in ihrem Anwendungsbereich und ihrer Allgemeingültigkeit begrenzt. In diesem Paper stellen wir LLMCache vor, einen neuartigen, schichtweisen Caching-Framework, das die Transformer-Inferenz beschleunigt, indem es intermediäre Aktivierungen basierend auf der semantischen Ähnlichkeit von Eingabefolgen wiederverwendet. Im Gegensatz zu vorherigen Ansätzen ist LLMCache modellunabhängig, funktioniert sowohl in Encoder- als auch in Decoder-Architekturen und ermöglicht das Caching an beliebigen Transformer-Schichten. Wir führen eine leichtgewichtige Fingerprinting-Methode zur Erkennung semantisch ähnlicher Eingaben ein und schlagen adaptive Eviktionstrategien zur Verwaltung von Cache-Veraltungen vor. Experimente an BERT und GPT-2 auf den Datensätzen SQuAD, WikiText-103 und OpenBookQA zeigen bis zu 3,1-fache Beschleunigung der Inferenzzeit bei einer Genauigkeitsdegradation von weniger als 0,5 %. Unsere Ergebnisse unterstreichen LLMCache als eine praktikable und allgemein verwendbare Lösung zur Optimierung der Transformer-Inferenz in realen Anwendungen.

One-sentence Summary

Harsh Vardhan Bansal from Amazon Web Services proposes LLMCache, a model-agnostic layer-wise caching framework that accelerates transformer inference by reusing intermediate activations through semantic similarity matching. Unlike token-level key-value caches, it operates across both encoder and decoder architectures with adaptive eviction strategies, achieving up to 3.1× speedup with minimal accuracy degradation in real-world applications like chat systems and document processing pipelines.

Key Contributions

Transformer inference faces high latency due to redundant computations on semantically similar inputs, as existing token-level caches like key-value caching are restricted to decoder-only architectures and cannot exploit cross-input activation reuse in encoder or encoder-decoder models.
LLMCache introduces a model-agnostic layer-wise caching framework that reuses intermediate activations by matching input fingerprints based on semantic similarity, operating across arbitrary transformer layers with adaptive eviction strategies to manage cache staleness without model retraining.
Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA demonstrate up to 3.1× inference speedup with less than 0.5% accuracy degradation, validating its effectiveness for real-world applications like conversational agents and document pipelines.

Introduction

Transformer inference latency remains a critical barrier for real-time deployment of large language models, especially in applications like conversational AI and document processing where inputs often share semantic or structural similarities. Prior optimization techniques such as quantization, pruning, or key-value caching suffer from key limitations: quantization and pruning require retraining or sacrifice accuracy, while standard key-value caching only accelerates autoregressive decoding in decoder-only models and cannot reuse intermediate activations across encoder or encoder-decoder architectures. The authors leverage layer-wise caching of intermediate activations to address this gap, introducing LLMCache—a model-agnostic framework that fingerprints input semantics to identify and reuse stable representations across arbitrary transformer layers. Their approach supports both encoder and decoder models, uses adaptive eviction to manage cache staleness, and achieves up to 3.1× inference speedups with minimal accuracy loss on tasks like question answering and language modeling.

Method

The authors leverage a modular, layer-wise caching framework to accelerate transformer inference by reusing intermediate activations across semantically similar inputs. The system operates without modifying the underlying model architecture and is compatible with both encoder and decoder models. At its core, LLMCache introduces a semantic fingerprinting mechanism that enables adaptive matching, allowing reuse even under partial input drift.

The system architecture comprises five key components: an Input Fingerprint Generator, Layer-wise Cache Banks, a Cache Matching and Lookup Engine, a Layer Execution Manager, and a Cache Refresh and Replacement Controller. Refer to the framework diagram for a high-level view of how these components interact across transformer layers.

The inference workflow begins with the Input Fingerprint Generator, which computes a fixed-length semantic fingerprint $f_X$ for the input sequence $X = \{x_1, x_2, \ldots, x_n\}$ . This fingerprint is derived from aggregated token embeddings, optionally augmented with attention statistics, and compressed via SimHash or PCA to ensure efficient comparison. Fingerprints serve as keys for cache lookup and are compared using cosine similarity or Jaccard index, depending on the hashing scheme.

Each transformer layer $l$ maintains an independent cache bank $\mathcal{C}_l$ storing tuples $(f, h_l)$ , where $f$ is a fingerprint and $h_l$ is the corresponding hidden state output. During inference, the Cache Matching and Lookup Engine checks $\mathcal{C}_l$ for a fingerprint $f'$ such that $\text{sim}(f_X, f') \geq \tau$ , where $\tau$ is a tunable similarity threshold. If a match is found, the cached activation $h_l$ is reused; otherwise, the layer is computed normally and the result is stored in the cache.

The Layer Execution Manager acts as a dynamic decision gate, seamlessly integrating with the transformer’s forward pass by selecting between cached reuse and full computation at each layer. This is implemented via PyTorch module hooks or subclass overrides, preserving compatibility with existing model implementations.

As shown in the figure below, the inference flow proceeds layer by layer, with each layer independently deciding whether to reuse or recompute based on cache lookup results. This layer-wise granularity avoids the overhead of token-level key-value caching and enables fine-grained control over reuse behavior.

To maintain cache efficiency and prevent memory bloat, the Cache Refresh and Replacement Controller employs eviction policies such as Least Recently Used (LRU), staleness-aware decay, and divergence monitoring. Divergence is measured by tracking output drift for a given fingerprint across inference calls, triggering revalidation when performance degrades. Temporal decay factors further ensure that outdated entries are flushed over time.

The overall inference process is formalized as:

h_{l} = \begin{cases} C_{l}[f_{X}] & \text{if } \operatorname{sim}(f_{X}, f') > \tau \\ f_{l}(h_{l-1}) & \text{otherwise} \end{cases}

where $C_l$ is the cache for layer $l$ , $f'$ are existing fingerprint keys, and $\tau$ governs the trade-off between reuse rate and semantic fidelity. The system allows tuning of $\tau$ , cache size, and layer selection to balance speed and accuracy per application.

Experiment

Achieved 2.4× latency reduction on BERT-base and up to 3.1× overall versus NoCache across WikiText-103, SQuAD v2, and OpenBookQA, outperforming KV-Cache on GPT-2
Recorded up to 92% cache hit rates in lower/mid transformer layers on GPT-2 WikiText-103, with upper layers showing higher sensitivity to semantic variation
Maintained task accuracy within 0.5% drop across all benchmarks, demonstrating superior robustness versus DocCache due to finer-grained layer control
Enabled flexible memory-hit rate tradeoffs with logarithmic overhead growth, bounded by efficient fingerprinting in BERT-base experiments
Identified optimal similarity threshold (τ) between 0.82 and 0.88 via ablation, balancing reuse frequency and output fidelity

The authors use LLMCache to accelerate transformer inference by reusing intermediate representations, achieving significant latency reductions across BERT-base, DistilBERT, and GPT-2-small. Results show LLMCache cuts inference time by up to 2.4× compared to no caching and outperforms KV-Cache on GPT-2, indicating finer-grained reuse yields greater speedups. All models maintain high efficiency with minimal accuracy loss, validating the method’s practicality for real-time applications.

Results show that LLMCache maintains task accuracy within 0.5% of the baseline across all evaluated datasets, outperforming DocCache in preserving fidelity while enabling inference acceleration. The authors use this to validate the semantic stability of their layer-wise caching approach.

Quell-PDF

Inhaltsverzeichnis

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren

Sofort einsatzbereite GPUs

Die besten Preise

Erste Schritte Preise anzeigen

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates

Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen

Unterstützt von MailChimp

vor 2 Monaten

Harsh Vardhan Bansal

Inhaltsverzeichnis

Zusammenfassung

One-sentence Summary

Key Contributions

Transformer inference faces high latency due to redundant computations on semantically similar inputs, as existing token-level caches like key-value caching are restricted to decoder-only architectures and cannot exploit cross-input activation reuse in encoder or encoder-decoder models.
LLMCache introduces a model-agnostic layer-wise caching framework that reuses intermediate activations by matching input fingerprints based on semantic similarity, operating across arbitrary transformer layers with adaptive eviction strategies to manage cache staleness without model retraining.
Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA demonstrate up to 3.1× inference speedup with less than 0.5% accuracy degradation, validating its effectiveness for real-world applications like conversational agents and document pipelines.

Introduction

Method

The overall inference process is formalized as:

h_{l} = \begin{cases} C_{l}[f_{X}] & \text{if } \operatorname{sim}(f_{X}, f') > \tau \\ f_{l}(h_{l-1}) & \text{otherwise} \end{cases}

Experiment

Achieved 2.4× latency reduction on BERT-base and up to 3.1× overall versus NoCache across WikiText-103, SQuAD v2, and OpenBookQA, outperforming KV-Cache on GPT-2
Recorded up to 92% cache hit rates in lower/mid transformer layers on GPT-2 WikiText-103, with upper layers showing higher sensitivity to semantic variation
Maintained task accuracy within 0.5% drop across all benchmarks, demonstrating superior robustness versus DocCache due to finer-grained layer control
Enabled flexible memory-hit rate tradeoffs with logarithmic overhead growth, bounded by efficient fingerprinting in BERT-base experiments
Identified optimal similarity threshold (τ) between 0.82 and 0.88 via ablation, balancing reuse frequency and output fidelity

Quell-PDF

Inhaltsverzeichnis

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren

Sofort einsatzbereite GPUs

Die besten Preise

Erste Schritte Preise anzeigen

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates

Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen

Unterstützt von MailChimp

Command Palette

LLMCache: schichtweite Caching-Strategien zur beschleunigten Wiederverwendung bei der Transformer-Inferenz

Harsh Vardhan Bansal

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters

Command Palette

LLMCache: schichtweite Caching-Strategien zur beschleunigten Wiederverwendung bei der Transformer-Inferenz

Harsh Vardhan Bansal

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters

Command Palette

LLMCache: schichtweite Caching-Strategien zur beschleunigten Wiederverwendung bei der Transformer-Inferenz

Harsh Vardhan Bansal

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters