Researchers have studied online learning in episodic finite-horizon Markov decision processes with convex objective functions, referred to as concave utility reinforcement learning (CURL) problem.
This setting extends RL from linear to convex losses on the state-action distribution induced by the agent's policy, requiring new algorithmic approaches due to the non-linearity of CURL.
The first algorithm achieving near-optimal regret bounds for online CURL without prior knowledge on the transition function has been introduced, utilizing online mirror descent algorithm and exploration bonus.
Additionally, the bandit version of CURL has been addressed for the first time, with a sub-linear regret bound achieved by adapting techniques from bandit convex optimization to the MDP setting.