Reinforcement learning from human feedback (RLHF) enables machine learning systems to learn objectives from human feedback.
The Hidden Utility Bandit (HUB) framework is proposed to model differences in teacher rationality, expertise, and costliness, formalizing the problem of learning from multiple teachers.
The Active Teacher Selection (ATS) algorithm outperforms baseline algorithms by actively selecting when and which teacher to query.
The HUB framework and ATS algorithm facilitate future research on active teacher selection for robust reward modeling.