<ul><li>Sorting is a challenging task for Large Language Models (LLMs) due to weaknesses in faithfully representing input data, logical comparisons, and differentiating between syntax and semantics.</li><li>A new benchmark called SortBench for LLMs has been introduced, offering various difficulty levels and easy scalability.</li><li>Tests conducted on seven state-of-the-art LLMs, including test-time reasoning models, revealed that even highly capable models like o3-mini can struggle with sorting tasks that involve mixing syntax and semantics.</li><li>The models also face difficulties in preserving the faithfulness to input for long lists, often dropping or adding items. Test-time reasoning tends to overthink problems, leading to performance degradation.</li></ul>

SortBench: Benchmarking LLMs based on their ability to sort lists

Discover more