Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

Large Language Models struggle with memory demands from the growing Key-Value(KV) cache as context lengths increase. Existing compression methods homogenizehead dimensions or rely on attention-guided token pruning, often sacrificingaccuracy or introducing computational overhead. We propose FourierAttention, atraining-free framework that exploits the heterogeneous roles of transformerhead dimensions: lower dimensions prioritize local context, while upper onescapture long-range dependencies. By projecting the long-context-insensitivedimensions onto orthogonal Fourier bases, FourierAttention approximates theirtemporal evolution with fixed-length spectral coefficients. Evaluations onLLaMA models show that FourierAttention achieves the best long-context accuracyon LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel,FlashFourierAttention, is designed to optimize memory via streamlinedread-write operations, enabling efficient deployment without performancecompromise.