HyperAIHyperAI

Command Palette

Search for a command to run...

FlashMoBA: 14.7x Faster Long-Context LLMs Through GPU-Optimized Attention

Long-context processing has long been a bottleneck for large language models (LLMs), as the standard self-attention mechanism scales quadratically with sequence length, making computation prohibitively expensive as inputs grow. In February this year, Moonshot AI introduced Mixture of Block Attention (MoBA), a novel architecture inspired by mixture-of-experts (MoE) systems. MoBA divides long sequences into smaller blocks and uses a routing mechanism to dynamically activate only the most relevant ones, reducing computational complexity from quadratic to near-linear. This breakthrough promised scalable long-context handling—but practical deployment faced two major hurdles: unclear design principles behind its success and a lack of hardware-optimized implementations. Despite its theoretical advantages, naive implementations of MoBA suffer from high overhead when using small block sizes. The cost of managing thousands of tiny blocks can outweigh the benefits of sparsity, especially on modern GPUs, where memory access patterns and compute efficiency are critical. This gap between theory and real-world performance left MoBA’s full potential untapped. To bridge this gap, researchers from MIT’s H.A.N. Lab, led by Professor Song Han, have partnered with NVIDIA to introduce FlashMoBA—a hardware-aware, CUDA-optimized implementation of MoBA that unlocks its true performance. The work not only explains why MoBA works but also re-engineers it from the ground up for modern GPU architectures. The key insight from the team’s analysis lies in quantifying the routing process through a signal-to-noise ratio (SNR) model. They found that routing accuracy depends on the ratio between the attention head dimension (d) and the block size (B). In short, smaller blocks improve routing precision—provided the model’s capacity remains constant. The team further discovered that applying short convolutions within blocks helps cluster relevant information, amplifying the signal and enhancing performance. However, running many small blocks on GPUs is inefficient due to three major issues: frequent, non-coalesced memory accesses, high overhead from sorting and scoring thousands of blocks, and underutilized GPU compute due to task granularity. FlashMoBA addresses all of these with a complete re-architected CUDA kernel designed for performance and memory efficiency. The solution centers on two core innovations: FlashTopK: A fully integrated, end-to-end pipeline that computes block centroids and selects top-k relevant blocks in a single kernel, without ever materializing a large, memory-heavy score matrix. This eliminates the primary bottleneck of traditional top-k operations and prevents out-of-memory (OOM) errors. Gather-and-Densify: A two-stage strategy inspired by efficient data management. First, it gathers all necessary data from scattered blocks into GPU’s high-speed cache (SRAM). Then, it reorganizes the sparse data into dense, contiguous matrices—exactly the format GPUs process most efficiently. This dramatically reduces HBM bandwidth usage and maximizes compute utilization. The results are striking. On a 64K sequence, FlashMoBA achieves 7.4× speedup over the original MoBA implementation and reduces memory usage by 6.1×. While the original MoBA fails at 128K due to memory limits, FlashMoBA successfully handles sequences up to 512K—over four times longer. When compared directly to FlashAttention-2, the current industry standard, FlashMoBA delivers a 14.7× speedup on long sequences. Crucially, this performance gain comes without sacrificing model quality. The team trained multiple models from scratch and found that smaller block sizes significantly improve performance on both language modeling and long-context retrieval tasks. The improvement stems from mitigating “attention dilution”—a common problem in long sequences where standard attention spreads focus too thinly. By routing computation only to the most relevant blocks, MoBA maintains sharp focus, leading to better accuracy, especially in complex, long-form reasoning. The research demonstrates that theoretical advances in model architecture must be paired with deep hardware-aware optimization to realize real-world impact. FlashMoBA is not just an incremental improvement—it’s a paradigm shift in how we think about efficient, scalable attention for long-context AI. For more details, see: 1. Paper: https://arxiv.org/pdf/2511.11571 2. Project code: https://github.com/mit-han-lab/flash-moba

Related Links