CS-Sum is introduced to evaluate the comprehensibility of code-switching by Large Language Models (LLMs) through dialogue summarization in multiple language pairs.
CS-Sum is the first benchmark for code-switching dialogue summarization across Mandarin-English, Tamil-English, and Malay-English, with human-annotated dialogues per language pair.
Evaluation of ten LLMs reveals that although the scores on automated metrics are high, LLMs make subtle mistakes that can change the meaning of the dialogue.
The study identifies common errors made by LLMs when processing code-switched input and highlights the varying error rates across language pairs, emphasizing the need for specialized training on code-switched data.