Diffusion models have gained success in text-to-image generation.A large-scale and fully end-to-end diffusion model is proposed for multi-modal understanding and generation.The model supports vision-language modeling capabilities and a wide range of tasks.This multimodal diffusion modeling shows potential as an alternative to autoregressive next-token prediction models.