Neural language models acquire a language's structure through theoretical scaling laws based on training for next-token prediction.
The study focuses on synthetic datasets generated by the Random Hierarchy Model (RHM) to capture the hierarchical structure of natural language.
Convolutional networks show faster scaling of performance compared to transformer models due to their alignment with the generative process through locality and weight sharing.
The interaction between model architecture and data properties shapes representation learning in neural models.