When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

Multimodal large language models (MLLMs) have made remarkable strides,largely driven by their ability to process increasingly long and complexcontexts, such as high-resolution images, extended video sequences, and lengthyaudio input. While this ability significantly enhances MLLM capabilities, itintroduces substantial computational challenges, primarily due to the quadraticcomplexity of self-attention mechanisms with numerous input tokens. To mitigatethese bottlenecks, token compression has emerged as an auspicious and criticalapproach, efficiently reducing the number of tokens during both training andinference. In this paper, we present the first systematic survey and synthesisof the burgeoning field of multimodal long context token compression.Recognizing that effective compression strategies are deeply tied to the uniquecharacteristics and redundancies of each modality, we categorize existingapproaches by their primary data focus, enabling researchers to quickly accessand learn methods tailored to their specific area of interest: (1)image-centric compression, which addresses spatial redundancy in visual data;(2) video-centric compression, which tackles spatio-temporal redundancy indynamic sequences; and (3) audio-centric compression, which handles temporaland spectral redundancy in acoustic signals. Beyond this modality-drivencategorization, we further dissect methods based on their underlyingmechanisms, including transformation-based, similarity-based, attention-based,and query-based approaches. By providing a comprehensive and structuredoverview, this survey aims to consolidate current progress, identify keychallenges, and inspire future research directions in this rapidly evolvingdomain. We also maintain a public repository to continuously track and updatethe latest advances in this promising area.