Preference learning is crucial for aligning generative models with human expectations.
Existing approaches for diffusion models like Diffusion-DPO face challenges of timestep-dependent instability and off-policy bias.
A new method called SDPO (Importance-Sampled Direct Preference Optimization) addresses these challenges by incorporating importance sampling into the objective to correct off-policy bias effectively.
Experiments show that SDPO outperforms standard Diffusion-DPO in terms of VBench scores, human preference alignment, and training robustness.