<ul><li>Test-time scaling paradigms have advanced the capabilities of large language models (LLMs) on complex tasks.</li><li>Theoretical understanding of the sample efficiency of various test-time strategies like self-consistency, best-of-$n$, and self-correction is limited.</li><li>A separation result shows that self-consistency requires more samples than best-of-$n$ to produce the correct answer based on probability gap between answers.</li><li>The self-correction approach with verifier feedback allows Transformers to simulate online learning over a pool of experts at test time, extending their representation theory to multi-task settings.</li></ul>

Sample Complexity and Representation Ability of Test-time Scaling Paradigms

Discover more