Alignment techniques in reinforcement learning have limitations such as being complex, time-consuming, memory intensive, and unstable during training processes.
A proposed solution, UNA (Unified Alignment), unifies RLHF/PPO, DPO, and KTO techniques and can accommodate different feedback types.
UNA aims to minimize the difference between an implicit reward and an explicit reward, outperforming RLHF/PPO while simplifying and speeding up the RL fine-tuning process.
In experiments, UNA performs better than DPO, KTO, and RLHF.