Transformers struggle with length generalisation, displaying poor performance even on basic tasks.
Two key failures of the self-attention mechanism in Transformers are identified: inability to fully remove irrelevant information and unintentional up-weighting of irrelevant information due to learned positional biases.
Selective sparsity and contextualised relative distance are proposed as two mitigations to improve the generalisation capabilities of decoder only transformers.
Refactoring the attention mechanism with these two mitigations in place can substantially enhance the performance of transformers in handling length generalisation.