HyperAIHyperAI

Command Palette

Search for a command to run...

LLMCache: Transformer 추론에서 재사용을 가속화하기 위한 계층별 캐싱 전략

Harsh Vardhan Bansal

초록

기반 기반 언어 모델은 다양한 작업에서 뛰어난 성능을 달성했지만, 높은 추론 지연 시간은 실시간 및 대규모 배포에 있어 중요한 도전 과제를 안고 있다. 기존의 캐싱 기법, 예를 들어 토큰 수준의 키-밸류 캐시는 순차적 디코딩에서 속도 향상을 제공하지만, 적용 범위와 활용 가능성 측면에서 한계가 있다. 본 논문에서는 입력 시퀀스의 의미적 유사성을 기반으로 중간 활성화 값을 재사용함으로써 트랜스포머 추론을 가속화하는 새로운 계층별 캐싱 프레임워크인 LLMCache를 제안한다. 기존 연구들과 달리, LLMCache는 모델에 종속되지 않으며 인코더 및 디코더 아키텍처 모두에서 작동하며, 임의의 트랜스포머 계층에서 캐싱을 지원한다. 의미적으로 유사한 입력을 매칭하기 위한 경량화된 지문(지문) 기반 매칭 메커니즘을 도입하고, 캐시 노화를 관리하기 위한 적응형 제거 전략을 제안한다. SQuAD, WikiText-103, OpenBookQA에서 BERT와 GPT-2에 대한 실험 결과, 정확도 저하가 0.5% 미만인 범위 내에서 추론 시간이 최대 3.1배까지 가속화됨을 확인하였다. 본 연구 결과는 LLMCache가 실제 응용에서 트랜스포머 추론을 최적화하기 위한 실용적이고 일반적인 해결책임을 강조한다.

One-sentence Summary

Harsh Vardhan Bansal from Amazon Web Services proposes LLMCache, a model-agnostic layer-wise caching framework that accelerates transformer inference by reusing intermediate activations through semantic similarity matching. Unlike token-level key-value caches, it operates across both encoder and decoder architectures with adaptive eviction strategies, achieving up to 3.1× speedup with minimal accuracy degradation in real-world applications like chat systems and document processing pipelines.

Key Contributions

  • Transformer inference faces high latency due to redundant computations on semantically similar inputs, as existing token-level caches like key-value caching are restricted to decoder-only architectures and cannot exploit cross-input activation reuse in encoder or encoder-decoder models.
  • LLMCache introduces a model-agnostic layer-wise caching framework that reuses intermediate activations by matching input fingerprints based on semantic similarity, operating across arbitrary transformer layers with adaptive eviction strategies to manage cache staleness without model retraining.
  • Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA demonstrate up to 3.1× inference speedup with less than 0.5% accuracy degradation, validating its effectiveness for real-world applications like conversational agents and document pipelines.

Introduction

Transformer inference latency remains a critical barrier for real-time deployment of large language models, especially in applications like conversational AI and document processing where inputs often share semantic or structural similarities. Prior optimization techniques such as quantization, pruning, or key-value caching suffer from key limitations: quantization and pruning require retraining or sacrifice accuracy, while standard key-value caching only accelerates autoregressive decoding in decoder-only models and cannot reuse intermediate activations across encoder or encoder-decoder architectures. The authors leverage layer-wise caching of intermediate activations to address this gap, introducing LLMCache—a model-agnostic framework that fingerprints input semantics to identify and reuse stable representations across arbitrary transformer layers. Their approach supports both encoder and decoder models, uses adaptive eviction to manage cache staleness, and achieves up to 3.1× inference speedups with minimal accuracy loss on tasks like question answering and language modeling.

Method

The authors leverage a modular, layer-wise caching framework to accelerate transformer inference by reusing intermediate activations across semantically similar inputs. The system operates without modifying the underlying model architecture and is compatible with both encoder and decoder models. At its core, LLMCache introduces a semantic fingerprinting mechanism that enables adaptive matching, allowing reuse even under partial input drift.

The system architecture comprises five key components: an Input Fingerprint Generator, Layer-wise Cache Banks, a Cache Matching and Lookup Engine, a Layer Execution Manager, and a Cache Refresh and Replacement Controller. Refer to the framework diagram for a high-level view of how these components interact across transformer layers.

The inference workflow begins with the Input Fingerprint Generator, which computes a fixed-length semantic fingerprint fXf_XfX for the input sequence X={x1,x2,,xn}X = \{x_1, x_2, \ldots, x_n\}X={x1,x2,,xn}. This fingerprint is derived from aggregated token embeddings, optionally augmented with attention statistics, and compressed via SimHash or PCA to ensure efficient comparison. Fingerprints serve as keys for cache lookup and are compared using cosine similarity or Jaccard index, depending on the hashing scheme.

Each transformer layer lll maintains an independent cache bank Cl\mathcal{C}_lCl storing tuples (f,hl)(f, h_l)(f,hl), where fff is a fingerprint and hlh_lhl is the corresponding hidden state output. During inference, the Cache Matching and Lookup Engine checks Cl\mathcal{C}_lCl for a fingerprint ff'f such that sim(fX,f)τ\text{sim}(f_X, f') \geq \tausim(fX,f)τ, where τ\tauτ is a tunable similarity threshold. If a match is found, the cached activation hlh_lhl is reused; otherwise, the layer is computed normally and the result is stored in the cache.

The Layer Execution Manager acts as a dynamic decision gate, seamlessly integrating with the transformer’s forward pass by selecting between cached reuse and full computation at each layer. This is implemented via PyTorch module hooks or subclass overrides, preserving compatibility with existing model implementations.

As shown in the figure below, the inference flow proceeds layer by layer, with each layer independently deciding whether to reuse or recompute based on cache lookup results. This layer-wise granularity avoids the overhead of token-level key-value caching and enables fine-grained control over reuse behavior.

To maintain cache efficiency and prevent memory bloat, the Cache Refresh and Replacement Controller employs eviction policies such as Least Recently Used (LRU), staleness-aware decay, and divergence monitoring. Divergence is measured by tracking output drift for a given fingerprint across inference calls, triggering revalidation when performance degrades. Temporal decay factors further ensure that outdated entries are flushed over time.

The overall inference process is formalized as:

hl={Cl[fX]if sim(fX,f)>τfl(hl1)otherwiseh_{l} = \begin{cases} C_{l}[f_{X}] & \text{if } \operatorname{sim}(f_{X}, f') > \tau \\ f_{l}(h_{l-1}) & \text{otherwise} \end{cases}hl={Cl[fX]fl(hl1)if sim(fX,f)>τotherwise

where ClC_lCl is the cache for layer lll, ff'f are existing fingerprint keys, and τ\tauτ governs the trade-off between reuse rate and semantic fidelity. The system allows tuning of τ\tauτ, cache size, and layer selection to balance speed and accuracy per application.

Experiment

  • Achieved 2.4× latency reduction on BERT-base and up to 3.1× overall versus NoCache across WikiText-103, SQuAD v2, and OpenBookQA, outperforming KV-Cache on GPT-2
  • Recorded up to 92% cache hit rates in lower/mid transformer layers on GPT-2 WikiText-103, with upper layers showing higher sensitivity to semantic variation
  • Maintained task accuracy within 0.5% drop across all benchmarks, demonstrating superior robustness versus DocCache due to finer-grained layer control
  • Enabled flexible memory-hit rate tradeoffs with logarithmic overhead growth, bounded by efficient fingerprinting in BERT-base experiments
  • Identified optimal similarity threshold (τ) between 0.82 and 0.88 via ablation, balancing reuse frequency and output fidelity

The authors use LLMCache to accelerate transformer inference by reusing intermediate representations, achieving significant latency reductions across BERT-base, DistilBERT, and GPT-2-small. Results show LLMCache cuts inference time by up to 2.4× compared to no caching and outperforms KV-Cache on GPT-2, indicating finer-grained reuse yields greater speedups. All models maintain high efficiency with minimal accuracy loss, validating the method’s practicality for real-time applications.

Results show that LLMCache maintains task accuracy within 0.5% of the baseline across all evaluated datasets, outperforming DocCache in preserving fidelity while enabling inference acceleration. The authors use this to validate the semantic stability of their layer-wise caching approach.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
LLMCache: Transformer 추론에서 재사용을 가속화하기 위한 계층별 캐싱 전략 | 문서 | HyperAI초신경