RadZero is a similarity-based cross-attention framework designed for vision-language alignment in radiology with zero-shot multi-task capability.
It addresses the challenges of effectively utilizing complex radiology reports, relying on low-resolution images, and limited interpretability in attention mechanisms.
RadZero leverages large language models to extract semantic sentences from radiology reports and employs a multi-positive contrastive learning strategy to capture relationships between images and textual descriptions.
Experimental results show that RadZero outperforms state-of-the-art methods in zero-shot classification, grounding, segmentation, and improves explainability in vision-language alignment.