Offline reinforcement learning (RL) in healthcare faces challenges due to out-of-distribution (OOD) issues, which can lead to harmful recommendations beyond clinical expertise.
Existing methods like conservative Q-learning (CQL) have limitations in addressing OOD problems by only constraining action selection, imitating short-term reward-focused clinician actions.
A new model-based Offline Guarded Safe Reinforcement Learning (OGSRL) framework is proposed to enhance treatment optimization strategies by regulating both action selection and downstream state trajectories.
OGSRL introduces an OOD guardian for safe policy exploration and a safety cost constraint to ensure policies remain within validated regions and align with medical safety boundaries, offering theoretical guarantees on safety and near-optimality.