<ul><li>The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on-par performance compared to the standard RoPE-based Transformer.</li><li>Adaptive Computation Pruning (ACP) is introduced for FoX, a method that dynamically prunes computations involving input-output dependencies strongly decayed by the forget gate.</li><li>ACP reduces the number of FLOPs in softmax attention by around 70% across different model sizes and context lengths, resulting in a 10-35% improvement in training throughput.</li><li>The computational savings are greater with longer context lengths, and the performance of FoX is not affected by ACP.</li></ul>

Adaptive Computation Pruning for the Forgetting Transformer

Discover more