Sorting is a challenging task for Large Language Models (LLMs) due to weaknesses in faithfully representing input data, logical comparisons, and differentiating between syntax and semantics.
A new benchmark called SortBench for LLMs has been introduced, offering various difficulty levels and easy scalability.
Tests conducted on seven state-of-the-art LLMs, including test-time reasoning models, revealed that even highly capable models like o3-mini can struggle with sorting tasks that involve mixing syntax and semantics.
The models also face difficulties in preserving the faithfulness to input for long lists, often dropping or adding items. Test-time reasoning tends to overthink problems, leading to performance degradation.