In the zero-shot policy transfer setting in reinforcement learning, training an agent on a fixed set of environments allows it to generalize to unseen environments.
Policy distillation after training can enhance performance in testing environments, with the theory suggesting training an ensemble of distilled policies and using diverse training data for distillation.
A generalization bound for policy distillation after training has been proven in this paper, offering insights for improved generalization in reinforcement learning.
Empirical verification shows that utilizing an ensemble of policies distilled on a diverse dataset can lead to significantly better generalization compared to the original agent.