Transformer models face challenges with long-context inference due to quadratic time and linear memory complexity.
Recurrent Memory Transformers (RMTs) address this by reducing the cost to linear time and constant memory usage, but suffer from a sequential execution bottleneck.
Diagonal Batching is introduced as a scheduling scheme in RMTs to enable parallelism across segments, enhancing GPU inference efficiency without the need for complex batching techniques.
Implementing Diagonal Batching in ARMT model leads to significant speedups, strengthening the practicality of RMTs for real-world applications with long contexts.