Researchers introduce a joint training framework for product and user queries by aligning uni-modal and multi-modal encoders through contrastive learning on image-text data.
The framework eliminates the reliance on engagement history and trains a query encoder using an LLM-curated relevance dataset.
The generated embeddings demonstrate strong generalization capabilities and improve performance in applications such as product categorization and relevance prediction.
The deployment of the framework shows a significant uplift in click-through rate and conversion rate for personalized ads recommendation.