A new method called Provably Lifetime Safe RL (PLS) has been proposed for safe reinforcement learning (RL).
PLS integrates offline safe RL with safe policy deployment to ensure the safety of a policy from learning to operation.
The method learns a policy offline using return-conditioned supervised learning and optimizes target returns using Gaussian processes (GPs) during deployment.
Empirical results show that PLS outperforms baselines in safety and reward performance, achieving the goal of high rewards while maintaining policy safety.