The Transformer architecture is essential for Large Language Models' success in various algorithmic tasks through gradient-based training for next-token prediction.
Comparison of standard Transformers with variants freezing MLP layers or attention projectors shows importance of attention in performance gains.
MixiT model with fixed random attention coefficients matches fully trained Transformers in arithmetic and memorization tasks, but underperforms in retrieval-based tasks due to lack of specialized circuits like induction heads.
Results highlight the significance of architectural heterogeneity for solving diverse tasks where distinct components offer essential inductive biases.