<ul><li>The paper introduces Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly.</li><li>TPA significantly reduces the memory overhead during inference by shrinking the size of the key-value (KV) cache.</li><li>Based on TPA, the Tensor ProducT ATTenTion Transformer (T6) is introduced as a new model architecture for sequence modeling.</li><li>T6 outperforms standard Transformer baselines in language modeling tasks, achieving improved model quality and memory efficiency.</li></ul>

Tensor Product Attention Is All You Need

Discover more