Reinforcement learning with verifiable rewards (RLVR) has been successful in improving math and coding in large language models.
Efforts are being made to extend RLVR into real-world domains like forecasting, where outcome-based reinforcement learning is crucial.
New adaptations of algorithms, Group-Relative Policy Optimisation (GRPO) and ReMax, have shown promising results in forecasting accuracy and calibration.
Refined RLVR methods can potentially transform small-scale language models into economically valuable forecasting tools, with implications for scaling to larger models.