Researchers propose a way to optimize chain-of-thought with reinforcement learning without an external reward function.They use a simpler Jensen's lower bound to derive tractable objectives for large-scale training.The approach can be applied to various methods such as supervised fine-tuning and online reinforcement learning.The results demonstrate the potential of this new algorithmic paradigm for generic applications.