Star Attention Block Sparse Attention Mechanism
Star Attention is a block sparse attention mechanism proposed by NVIDIA in 2024, designed to improve the reasoning efficiency of Transformer-based large language models (LLMs) on long sequences. This mechanism significantly improves the reasoning speed through a two-stage processing flow, and optimizes the use of computing resources while maintaining high accuracy.
The relevant paper results areStar Attention: Efficient LLM Inference over Long Sequences", the paper details the working principle and advantages of Star Attention, including its operation in 2 stages: the first stage is context encoding, and the second stage is query processing and token generation. Star Attention can significantly reduce inference time, reducing memory requirements and inference time by up to 11 times while maintaining 95-100% accuracy.