SparseMM, a Strategy for KV-Cache Optimization Using Visual Head Sparsity
The strategy of optimizing KV-Cache by using the sparsity of visual heads (Sparsity Emerges from Visual Concept Responses in MLLMs, referred to as SparseMM) is a key-value cache optimization strategy proposed by the Intelligent Vision Laboratory of Tsinghua University and Tencent Hunyuan X Group on June 5, 2025. It allocates an asymmetric computing budget to each attention head in the large language model according to the visual score. The related paper results are "SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs".
Compared with previous methods, SparseMM prioritizes and preserves visual semantics during decoding. Extensive evaluations on mainstream multimodal benchmarks show that SparseMM achieves a better accuracy-efficiency trade-off. In the efficiency test, SparseMM achieves 1.38x real-time speedup and 52% memory reduction while maintaining comparable performance.