To apply reinforcement learning to safety-critical applications, safety guarantees during policy training and deployment are necessary.
The paper presents the concept of Safe Policy Ratio (SPoRt) to provide a bound on the probability of violating a safety property in a model-free, episodic setup.
SPoRt includes Projected PPO, a new approach for training task-specific policies while maintaining a user-specified bound on property violation.
The experimental results demonstrate the trade-off between safety guarantees and task-specific performance in SPoRt.