<ul><li>Preference learning is crucial for aligning generative models with human expectations.</li><li>Existing approaches for diffusion models like Diffusion-DPO face challenges of timestep-dependent instability and off-policy bias.</li><li>A new method called SDPO (Importance-Sampled Direct Preference Optimization) addresses these challenges by incorporating importance sampling into the objective to correct off-policy bias effectively.</li><li>Experiments show that SDPO outperforms standard Diffusion-DPO in terms of VBench scores, human preference alignment, and training robustness.</li></ul>

SDPO: Importance-Sampled Direct Preference Optimization for Stable Diffusion Training

Discover more