<ul data-eligibleForWebStory="true"><li>Study on reinforcement learning from human feedback in general Markov decision processes focusing on trajectory-level preference comparisons.</li><li>Challenge: Designing algorithms for informative preference queries to identify rewards with theoretical guarantees.</li><li>Proposed a meta-algorithm based on randomized exploration to address challenges without computational complexity.</li><li>Established regret and last-iterate guarantees under mild reinforcement learning oracle assumptions.</li><li>Introduced an improved algorithm that collects batches of trajectory pairs and uses optimal experimental design for informative queries.</li><li>Batch structure enables parallelization of preference queries, enhancing practical deployment efficiency.</li><li>Empirical evaluation confirms competitiveness with reward-based reinforcement learning using minimal preference queries.</li></ul>

Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design

Discover more