The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on-par performance compared to the standard RoPE-based Transformer.
Adaptive Computation Pruning (ACP) is introduced for FoX, a method that dynamically prunes computations involving input-output dependencies strongly decayed by the forget gate.
ACP reduces the number of FLOPs in softmax attention by around 70% across different model sizes and context lengths, resulting in a 10-35% improvement in training throughput.
The computational savings are greater with longer context lengths, and the performance of FoX is not affected by ACP.