Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taiga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, Aviral Kumar, Rishabh Agarwal
Scaling up models has been less straightforward in Reinforcement Learning compared to its overwhelming success in supervised settings. In previous work, several examples have been pointed out where increasing model capacity or training iterations eventually starts to decrease performance.
This paper investigates replacing the usual MSE loss on the scalar value function with cross-entropy loss, after quantizing the value into a number of bins. A previous approach distributed the probability mass into the two closest bins such that the expectation matches the original value (“Two-Hot encoding”). The authors propose smearing with a Gaussian and initially attempt adapting its width to the bin size (essentially proposing a form of Six-Hot encoding) but discover that the absolute variance of the Gaussian, rather than the number of bins in its effective support, is the relevant hyperparameter.
The results demonstrate improved performance over alternate distributions as well as standard regression and achieve monotonic scaling in the cases where regression did not. Ablation demonstrates that the use of the cross-entropy loss is critical compared to just lifting the representation from a scalar to a distribution.
Stop Regressing: Training Value Functions via Classification for Scalable Deep RL