Human preference alignment can greatly enhance Multimodal Large Language Models (MLLMs), but collecting high-quality preference data is costly.A novel multimodal self-evolution framework is proposed to autonomously generate high-quality questions and answers using only unannotated images.The framework incorporates an image-driven self-questioning mechanism, answer self-enhancement technique, and image content alignment loss function.Experiments show that the framework performs competitively with methods using external information, providing a more efficient approach to MLLMs.