<ul><li>Researchers propose a way to optimize chain-of-thought with reinforcement learning without an external reward function.</li><li>They use a simpler Jensen's lower bound to derive tractable objectives for large-scale training.</li><li>The approach can be applied to various methods such as supervised fine-tuning and online reinforcement learning.</li><li>The results demonstrate the potential of this new algorithmic paradigm for generic applications.</li></ul>

Learning to chain-of-thought with Jensen's evidence lower bound

Discover more