Grokking, a phenomenon observed in neural networks post perfect training, has been linked to embeddings in Transformers and MLPs.
Introducing embeddings in MLPs induces delayed generalization in modular arithmetic tasks, highlighting their central role in grokking.
The analysis identifies two key mechanisms driving grokking: embedding update dynamics and bilinear coupling between embeddings and downstream weights.
Methods like frequency-aware sampling and embedding-specific learning rates are proposed to mitigate bilinear coupling effects and improve grokking dynamics in neural networks.