<ul><li>Entropy minimization (EM) without labeled data can significantly enhance the performance of large language models (LLMs) on math, physics, and coding tasks.</li><li>Three approaches were explored: EM-FT minimizes token-level entropy similarly to instruction fine-tuning; EM-RL uses reinforcement learning with negative entropy as the only reward; EM-INF adjusts logit at inference time to reduce entropy without data or parameter updates.</li><li>EM-RL achieved comparable performance to strong RL baselines like GRPO and RLOO on Qwen-7B without labeled data, while EM-INF enabled Qwen-32B to exceed the performance of models like GPT-4o and Gemini 1.5 Pro on SciCode benchmark.</li><li>Pretrained LLMs exhibit enhanced reasoning capabilities through entropy minimization alone, showcasing the potential for improved model performance without labeled data or parameter updates.</li></ul>

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Discover more