<ul><li>Researchers introduce a joint training framework for product and user queries by aligning uni-modal and multi-modal encoders through contrastive learning on image-text data.</li><li>The framework eliminates the reliance on engagement history and trains a query encoder using an LLM-curated relevance dataset.</li><li>The generated embeddings demonstrate strong generalization capabilities and improve performance in applications such as product categorization and relevance prediction.</li><li>The deployment of the framework shows a significant uplift in click-through rate and conversion rate for personalized ads recommendation.</li></ul>

DashCLIP: Leveraging multimodal models for generating semantic embeddings for DoorDash

Discover more