Jost Tobias Springenberg, Abbas Abdolmaleki, Jingwei Zhang, Oliver Groth, Michael Bloesch, Thomas Lampe, Philemon Brakel, Sarah Bechtle, Steven Kapturowski, Roland Hafner, Nicolas Heess, Martin Riedmiller
Large-scale models for policy learning in control/robotics have shown impressive mutli-task and generalisation capabilities in recent years, but so far policy learning in the generalist large-model regime has mostly relied on Behaviour Cloning, requiring near-optimal demonstrations during training. This work demonstrates the benefits of large-scale models for offline RL.
The key contribution is an offline actor-critic algorithm that allows to smoothly trade off RL and BC loss terms. This is combined with a scalable transformer-based multi-modal architecture to represent policy and value function. The experiments include scaling analysis as well as comparisons to strong BC baselines such as Gato (Reed et al., 2022) and RoboCat (Bousmalis et al., 2023) for pre-training, as well as an analysis of fine tuning with the critic. [1] [2]
[2] RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
Offline Actor-Critic Reinforcement Learning Scales to Large Models