The paper introduces Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly.
TPA significantly reduces the memory overhead during inference by shrinking the size of the key-value (KV) cache.
Based on TPA, the Tensor ProducT ATTenTion Transformer (T6) is introduced as a new model architecture for sequence modeling.
T6 outperforms standard Transformer baselines in language modeling tasks, achieving improved model quality and memory efficiency.