Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

Xiaoran Liu, Siyang He, Qiqi Wang, Ruixiao Li, Yuerong Song, Zhigeng Liu, Linlin Li, Qun Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu

Release Date: 6/16/2025

Beyond Homogeneous Attention: Memory-Efficient LLMs via
Fourier-Approximated KV Cache

Abstract

Large Language Models struggle with memory demands from the growing Key-Value(KV) cache as context lengths increase. Existing compression methods homogenizehead dimensions or rely on attention-guided token pruning, often sacrificingaccuracy or introducing computational overhead. We propose FourierAttention, atraining-free framework that exploits the heterogeneous roles of transformerhead dimensions: lower dimensions prioritize local context, while upper onescapture long-range dependencies. By projecting the long-context-insensitivedimensions onto orthogonal Fourier bases, FourierAttention approximates theirtemporal evolution with fixed-length spectral coefficients. Evaluations onLLaMA models show that FourierAttention achieves the best long-context accuracyon LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel,FlashFourierAttention, is designed to optimize memory via streamlinedread-write operations, enabling efficient deployment without performancecompromise.

View Paper Details