<ul><li>Transformers struggle with length generalisation, displaying poor performance even on basic tasks.</li><li>Two key failures of the self-attention mechanism in Transformers are identified: inability to fully remove irrelevant information and unintentional up-weighting of irrelevant information due to learned positional biases.</li><li>Selective sparsity and contextualised relative distance are proposed as two mitigations to improve the generalisation capabilities of decoder only transformers.</li><li>Refactoring the attention mechanism with these two mitigations in place can substantially enhance the performance of transformers in handling length generalisation.</li></ul>

TRA: Better Length Generalisation with Threshold Relative Attention

Discover more