A new approach called Energy Outcome Reward Model (EORM) is introduced to improve the reliability of reasoning steps elicited by large language models (LLMs).
EORM utilizes Energy Based Models (EBMs) for training reward models to assign energy scores to Chain of Thought solutions based on outcome labels, without the need for detailed annotations.
By interpreting discriminator output logits as negative energies, EORM ranks candidates such that solutions leading to correct final outcomes receive lower energy scores, promoting coherent reasoning.
On mathematical benchmarks like GSM8k and MATH, EORM enhances final answer accuracy and reasoning outcome reliability by effectively leveraging a pool of candidate solutions, surpassing the performance of brute force sampling.