The dueling bandit problem is gaining popularity in various fields due to its applications in online advertising, recommendation systems, and more.
Delays in feedback pose a challenge to existing dueling bandit literature, affecting the agent's ability to update their policy quickly and accurately.
A new problem called biased dueling bandit problem with stochastic delayed feedback is introduced, involving preference bias between selections.
Two algorithms are presented to handle delayed feedback, one requiring complete delay distribution information and the other only the expected value of delay.