Language models often struggle with cross-mode knowledge retrieval, i.e., accessing knowledge learned in one format when queried in another.
Models trained on multiple data sources show reduced accuracy when retrieving knowledge in a different format from their original training mode.
A controlled study of random token sequence memorization across different modes quantitatively investigates this limitation.
CASCADE, a novel pretraining algorithm using cascading datasets with varying sequence lengths, outperforms dataset rewriting approaches and enhances language models' cross-mode knowledge retrieval.