menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Learning t...
source image

Arxiv

1d

read

384

img
dot

Image Credit: Arxiv

Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision

  • A new approach called Energy Outcome Reward Model (EORM) is introduced to improve the reliability of reasoning steps elicited by large language models (LLMs).
  • EORM utilizes Energy Based Models (EBMs) for training reward models to assign energy scores to Chain of Thought solutions based on outcome labels, without the need for detailed annotations.
  • By interpreting discriminator output logits as negative energies, EORM ranks candidates such that solutions leading to correct final outcomes receive lower energy scores, promoting coherent reasoning.
  • On mathematical benchmarks like GSM8k and MATH, EORM enhances final answer accuracy and reasoning outcome reliability by effectively leveraging a pool of candidate solutions, surpassing the performance of brute force sampling.

Read Full Article

like

23 Likes

For uninterrupted reading, download the app