<ul><li>Grokking, a phenomenon observed in neural networks post perfect training, has been linked to embeddings in Transformers and MLPs.</li><li>Introducing embeddings in MLPs induces delayed generalization in modular arithmetic tasks, highlighting their central role in grokking.</li><li>The analysis identifies two key mechanisms driving grokking: embedding update dynamics and bilinear coupling between embeddings and downstream weights.</li><li>Methods like frequency-aware sampling and embedding-specific learning rates are proposed to mitigate bilinear coupling effects and improve grokking dynamics in neural networks.</li></ul>

Mechanistic Insights into Grokking from the Embedding Layer

Discover more