Stationary Distribution Correction Estimation (DICE) helps address the mismatch between stationary distribution induced by a policy and the target distribution required for reliable off-policy evaluation and policy optimization.
Recent approaches to enhance offline reinforcement learning performance inadvertently hinder DICE's ability for off-policy evaluation, especially in constrained reinforcement learning scenarios.
The limitation in recent approaches is attributed to their dependence on semi-gradient optimization, leading to failures in cost estimation in the DICE framework.
A novel method called semi-gradient DICE is proposed to overcome limitations and improve off-policy evaluation and performance in offline constrained reinforcement learning, achieving state-of-the-art results on the DSRL benchmark.