<ul><li>Human preference alignment can greatly enhance Multimodal Large Language Models (MLLMs), but collecting high-quality preference data is costly.</li><li>A novel multimodal self-evolution framework is proposed to autonomously generate high-quality questions and answers using only unannotated images.</li><li>The framework incorporates an image-driven self-questioning mechanism, answer self-enhancement technique, and image content alignment loss function.</li><li>Experiments show that the framework performs competitively with methods using external information, providing a more efficient approach to MLLMs.</li></ul>

Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution

Discover more