Researchers propose a novel inference-aware fine-tuning paradigm for large language models (LLMs).
The paradigm focuses on optimizing the performance of the inference-time strategy.
Imitation learning and reinforcement learning methods are devised to tackle the non-differentiable argmax operator within the Best-of-N (BoN) inference strategy.
Experiments show improved performance and inference-time compute using BoN-aware fine-tuning.