Leveraging more test-time computation can enhance the reasoning capabilities of large language models (LLMs).
The verify-and-improve paradigm allows dynamic solution exploration and feedback incorporation for LLMs.
A new reinforcement learning algorithm called DPSDP is introduced to improve LLM performance by training an actor-critic system to refine answers iteratively.
Empirical results show that using DPSDP with various base models leads to enhancements on both in- and out-of-distribution benchmarks, demonstrating the benefits of multi-agent collaboration.