Recent advancements in large language models (LLMs) have allowed the augmentation of information retrieval (IR) pipelines with synthetic data.
The traditional training paradigm in contrastive learning with binary relevance labels and InfoNCE loss treats all documents that are not explicitly annotated as relevant equally, regardless of their actual degree of relevance.
In this work, synthetic documents generated by open-source LLMs are used to create a fully synthetic ranking context of graduated relevance for training dense retrievers.
Experiments show that this approach outperforms conventional training and achieves comparable performance to retrievers trained on real, labeled training documents.