Reinforcement learning has been used to enhance the reasoning capabilities of Large Language Models (LLMs), but current approaches face limitations in guiding exploration and providing effective feedback.
A new method called Intrinsic Motivation guidEd exploratioN meThOd foR LLM Reasoning (i-MENTOR) is proposed to address these challenges by delivering dense rewards and amplifying explorations in the RL-based training paradigm.
i-MENTOR introduces trajectory-aware exploration rewards, dynamic reward scaling, and advantage-preserving reward implementation to improve performance in complex reasoning tasks.
Experiments show that i-MENTOR achieves a 22.39% improvement on the difficult dataset Countdown-4, demonstrating its effectiveness in enhancing LLM reasoning.