Pass-at-k Policy Optimization (PKPO) proposed as a fix for limitations in Reinforcement Learning algorithms optimizing for pass@1 performance.
PKPO transforms final rewards to optimize for pass@k performance, prioritizing sets of samples that maximize reward when considered jointly.
Novel low variance unbiased estimators derived for pass@k and its gradient in both binary and continuous reward settings.
PKPO enables robust optimization of pass@k for any arbitrary k <= n, allowing for annealing k during training to optimize both pass@1 and pass@k metrics.