Recent vision-language foundation models often produce misalignments in their outputs.A novel approach called CLIP4DM has been proposed to detect dense misalignments between image and text.CLIP4DM revamps the gradient-based attribution computation method to indicate misalignment.CLIP4DM demonstrates state-of-the-art performance and efficiency in detecting misalignments.