<ul data-eligibleForWebStory="true">Study on reinforcement learning from human feedback in general Markov decision processes focusing on trajectory-level preference comparisons.Challenge: Designing algorithms for informative preference queries to identify rewards with theoretical guarantees.Proposed a meta-algorithm based on randomized exploration to address challenges without computational complexity.Established regret and last-iterate guarantees under mild reinforcement learning oracle assumptions.Introduced an improved algorithm that collects batches of trajectory pairs and uses optimal experimental design for informative queries.Batch structure enables parallelization of preference queries, enhancing practical deployment efficiency.Empirical evaluation confirms competitiveness with reward-based reinforcement learning using minimal preference queries.