Large language models (LLMs) struggle to recognize low-resource languages, including African languages.
Combining curated data from African languages with high-quality English educational texts significantly improves the model's performance on these languages.
On the IrokoBench dataset, the models consistently achieve the best performance compared to other baselines, particularly on knowledge-intensive multiple-choice questions (AfriMMLU).
The models outperform the base model by over 10% on the cross-lingual question answering benchmark AfriQA.