Online safe reinforcement learning is crucial in dynamic environments like autonomous driving, robotics, and cybersecurity.
Existing methods for constrained Markov decision processes struggle in adversarial settings with unknown, time-varying constraints.
The Optimistic Mirror Descent Primal-Dual (OMDPD) algorithm is introduced to handle online CMDPs with anytime adversarial constraints.
OMDPD achieves optimal regret O(sqrt(K)) and strong constraint violation O(sqrt(K)) without requiring a strictly known safe policy, providing practical guarantees for safe decision-making in adversarial environments.