Entropy minimization (EM) without labeled data can significantly enhance the performance of large language models (LLMs) on math, physics, and coding tasks.
Three approaches were explored: EM-FT minimizes token-level entropy similarly to instruction fine-tuning; EM-RL uses reinforcement learning with negative entropy as the only reward; EM-INF adjusts logit at inference time to reduce entropy without data or parameter updates.
EM-RL achieved comparable performance to strong RL baselines like GRPO and RLOO on Qwen-7B without labeled data, while EM-INF enabled Qwen-32B to exceed the performance of models like GPT-4o and Gemini 1.5 Pro on SciCode benchmark.
Pretrained LLMs exhibit enhanced reasoning capabilities through entropy minimization alone, showcasing the potential for improved model performance without labeled data or parameter updates.