Christian Walder, Deep Karkhanis
With the rise of increasingly capable generative models, evaluating and improving them via reinforcement learning has become central to progress. Pass@K Policy Optimisation (PKPO) offers an elegant reframing of how sample efficiency and exploration should be handled.
While most RL pipelines optimise for pass@1, treating each sample independently, the authors highlight how this short-sighted focus undervalues the collective utility of a batch – an issue that becomes especially limiting on harder tasks where single-shot success is rare.
PKPO tackles this by directly optimising pass@k through a family of low-variance unbiased estimators for both binary and continuous rewards. These transformations jointly reshape batches of rewards in a way that is computationally stable and compatible with standard RL training loops. Importantly, unlike prior work that only considers the k = n case, the method generalises to any k <= n, enabling finer control over the exploration-exploitation landscape.
The results show clear practical benefits: higher k values unlock the ability to solve more challenging problems, while annealing k during training preserves strong pass@1 performance alongside substantial pass@k gains. By prioritising joint utility over isolated successes, PKPO offers a route to improving exploration and unblocking stalled learning. I’m excited to see how this approach shapes future RL fine-tuning pipelines across large model training.
Pass@K Policy Optimisation: Solving Harder Reinforcement Learning Problems