The relationship between memorization and generalization in large language models (LLMs) is being investigated in this study.
Pre-training capacity-limited Transformer models from scratch on synthetic character-level tasks showed a trade-off between memorization and generalization.
Small models excel in extrapolating unseen arithmetic cases but fail at memorization, whereas larger models are better at memorization but struggle with extrapolation.
An intermediate-capacity model also shows a shift toward memorization rather than generalization.
When trained on both tasks together, no size of model succeeds at extrapolation.
The study indicates that pre-training may inherently prioritize one learning mode over the other.
By examining these dynamics in a controlled setting, the research provides insights into how model capacity influences learning behavior and its implications for small language model design and deployment.