Domain randomization (DR) enables sim-to-real transfer by training controllers on a distribution of simulated environments.
Simple policy gradient (PG) methods are often used to solve DR, but the theoretical guarantees are limited.
A convergence analysis of PG methods for domain-randomized linear quadratic regulation (LQR) is provided in this study.
The study shows that PG converges globally under suitable bounds on the heterogeneity of sampled systems, and proposes a discount-factor annealing algorithm.