HyperAI超神经

FlashAttention is an efficient, memory-friendly attention algorithm proposed by Stanford University and the State University of New York in 2022. It aims to solve the high computational complexity and memory occupancy problems of the self-attention layer in the traditional Transformer model.The relevant paper results areFlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". The algorithm has been integrated into PyTorch 2.0 and has been integrated and implemented by multiple open source frameworks such as triton and xformer. It significantly speeds up the calculation by reordering the attention calculation, using tiling and recalculation techniques, and reduces the memory usage from the quadratic relationship of the sequence length to a linear relationship.

The introduction of FlashAttention enables large open source models such as Meta's LLaMA and the UAE's Falcon to accelerate calculations and save video memory. In addition, FlashAttention's subsequent version FlashAttention-2 has improved on the original basis, providing better parallelism and work partitioning, and was published by Tri Dao in July 2023 through the paper "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning"propose.

FlashAttention-3 was jointly proposed by a research team from Colfax Research, Meta, NVIDIA, Georgia Tech, Princeton University, and Together AI in July 2024. The relevant paper is “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision". As the latest version of the series, it achieves significant performance improvements on the H100 GPU, 1.5-2.0 times faster than FlashAttention-2, up to 740 TFLOPS, that is, the H100 theoretical maximum FLOPS utilization is 75%, and close to 1.2 PFLOPS 75 when using FP8. These improvements make LLM training and running much faster, while being able to use lower precision numbers (FP8) while maintaining accuracy, potentially reducing memory usage and saving costs.

FlashAttention Algorithm