Large language models (LLMs) face performance drops after a few high-resource languages due to pre-training data imbalance.
Inspired by second language acquisition, code-switching curriculum learning (CSCL) is proposed for enhancing cross-lingual transfer in LLMs.
CSCL mimics human language learning stages through token-level and sentence-level code-switching as well as monolingual corpora training.
Using Qwen 2 model, CSCL shows significant gains in language transfer to Korean compared to monolingual pre-training methods.
Ablation studies confirm the effectiveness of both token- and sentence-level code-switching in enhancing cross-lingual transfer, amplified by curriculum learning.
The study extends to languages like Japanese and Indonesian using Gemma 2 and Phi 3.5 models, demonstrating improved language transfer.
CSCL helps mitigate spurious correlations between language resources and safety alignment, offering an efficient framework for equitable language transfer in LLMs.
CSCL proves effective in low-resource settings lacking high-quality, monolingual corpora for language transfer.