Recent advances in generative AI have been driven by alignment techniques such as reinforcement learning from human feedback (RLHF).
This paper focuses on understanding the preferences encoded in datasets used for RLHF and identifying common human preferences.
A small subset of 21 preference categories captures over 89% of preference variation across individuals, serving as a canonical basis of human preferences.
The identified preference basis proves useful for model evaluation and training, offering insights into model alignment and successful fine-tuning.