Researchers have introduced Dirichlet Process Posterior Sampling (DPPS), a Bayesian non-parametric algorithm for multi-arm bandits.
DPPS, akin to Thompson-sampling, makes decisions based on posterior probabilities of arm optimality without assuming a parametric reward distribution.
The algorithm employs Dirichlet Process priors to model the reward generating distribution directly, offering a principled way to integrate prior beliefs.
Empirical studies demonstrate strong performance of DPPS in various bandit environments, with a non-asymptotic optimality shown through information-theoretic analysis.