In this paper, the authors propose an efficient algorithm for the online shortest path problem in directed acyclic graphs (DAGs) under bandit feedback against an adaptive adversary.
The algorithm achieves a near-minimax optimal regret bound of O(√|E|Tlog|X|) with high probability against any adaptive adversary.
The algorithm utilizes a novel loss estimator and a centroid-based decomposition to attain this regret bound.
The algorithm's application extends to various domains, including extensive-form games, shortest walks in directed graphs, hypercubes, and multi-task multi-armed bandits, providing improved regret guarantees in each of these settings.