Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

Sivtsov, Danil ; Rodkin, Ivan ; Kuzmin, Gleb ; Kuratov, Yuri ; Oseledets, Ivan

발행일: 6/8/2025

Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers
for Long Contexts

초록

Transformer models struggle with long-context inference due to theirquadratic time and linear memory complexity. Recurrent Memory Transformers(RMTs) offer a solution by reducing the asymptotic cost to linear time andconstant memory usage. However, their memory update mechanism leads tosequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme that unlocks parallelismacross segments in RMTs while preserving exact recurrence. This approacheliminates the sequential constraint, enabling efficient GPU inference even forsingle long-context inputs without complex batching and pipelining techniques.Because the technique is purely a run-time computation reordering, existing RMTmodels adopt it with no retraining. Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedupover standard full-attention LLaMA-1B and a 1.8x speedup over the sequentialRMT implementation on 131,072-token sequences. By removing sequentialbottleneck, Diagonal Batching reduces inference cost and latency, therebystrengthening RMTs as a practical solution for real-world, long-contextapplications.

논문 세부 정보 보기