Actor-critic methods are commonly used in online reinforcement learning for continuous action spaces.
Unlike algorithms for discrete actions, RL algorithms for continuous actions typically use the Bellman operator instead of the Bellman optimality operator to model Q-values.
Incorporating the Bellman optimality operator into actor-critic frameworks accelerates learning but may introduce overestimation bias.
A proposed annealing approach gradually transitions from the Bellman optimality operator to the Bellman operator to improve learning efficiency and mitigate bias, outperforming existing approaches in locomotion and manipulation tasks.