Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, Akash Srivastava
This paper introduces Rollout Roulette, a principled approach to inference-time scaling for large language models that reframes generation as a problem of probabilistic inference rather than deterministic search.
The method is applicable in a setting where a process reward model, which assigns scores to partial continuation sequences, is available. In this setting, commonly used decoding methods are inspired by beam search or Monte Carlo Tree Search, and use the given reward model in a relatively greedy way. Because of that, those methods can prematurely discard certain continuations, especially in case of early reward model errors or noise.
The authors propose instead to model the generation process as a state-space system in which the language model defines a transition distribution over tokens, while a process reward model provides noisy, imperfect observations of solution quality along partial trajectories.
Then it is possible to perform test time scaling by applying classical particle-based Monte Carlo filtering methods to sample from the distribution over trajectories of output tokens given a sequence of input tokens and observed partial rewards. In this setting, each particle represents a possible continuation trajectory, weighted according to the reward model and stochastically resampled.
Because discarding candidate continuations happens only when the weights of the set of candidate trajectories get sufficiently imbalanced, instead of greedily at each step, this probabilistic treatment allows the method to hedge against reward model uncertainty, preserve alternative reasoning paths and delay hard decisions until sufficient evidence is accumulated.
Empirically, the authors demonstrate that this particle filtering framework leads to substantially more efficient inference-time scaling on challenging reasoning tasks, achieving strong performance improvements and, in some settings, allowing smaller models to rival or surpass much larger models under fixed rollout budgets.
I found this paper really satisfying because of how it leverages an elegant classical method in a natural way in context of LLMs to deliver strong performance improvements.
Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods