Offline reinforcement learning aims to learn policies without online exploration by utilizing a dynamics model to generate simulation data for policy learning.
A new approach called offline trajectory optimization (OTTO) is proposed, which focuses on conducting long-horizon simulations and using model uncertainty to evaluate and correct the generated data.
OTTO utilizes an ensemble of Transformers known as World Transformers to predict environment dynamics and reward functions, generating long-horizon trajectory simulations and correcting low-confidence data through an uncertainty-based World Evaluator.
Experiments indicate that OTTO can enhance the performance of offline RL algorithms, even in complex environments with sparse rewards, such as AntMaze.