Researchers propose efficient learning algorithms for entropy-regularized Markov Decision Processes (MDPs) with large or continuous state and action spaces.
The algorithms integrate fixed-point iteration with multilevel Monte Carlo techniques and a stochastic approximation of the Bellman operator.
Using a biased plain Monte Carlo estimate for the Bellman operator leads to quasi-polynomial sample complexity, while an unbiased randomized multilevel approximation achieves polynomial sample complexity in expectation.
The proposed algorithms demonstrate performance guarantees independent of the dimensions or sizes of state and action spaces.