HyperAI超神经

Grouped Query Attention (GQA) is a method that interpolates between Multi-Query Attention (MQA) and Multi-Head Attention (MHA) in Large Language Models (LLM).Its goal is to achieve the quality of MHA while maintaining the speed of MQA.

Key attributes of GQA include:

Interpolation: GQA is an intermediate method between MQA and MHA, which solves the shortcomings of MQA, such as quality degradation and training instability.
efficiency: GQA optimizes performance while maintaining quality by using an intermediate number of key-value headers.
trade off: GQA strikes a balance between the speed of MQA and the quality of MHA, providing a favorable trade-off.

Grouped-query Attention (GQA)