Test-time scaling paradigms have advanced the capabilities of large language models (LLMs) on complex tasks.
Theoretical understanding of the sample efficiency of various test-time strategies like self-consistency, best-of-$n$, and self-correction is limited.
A separation result shows that self-consistency requires more samples than best-of-$n$ to produce the correct answer based on probability gap between answers.
The self-correction approach with verifier feedback allows Transformers to simulate online learning over a pool of experts at test time, extending their representation theory to multi-task settings.