IR2, Information Regularization for Information Retrieval, is a technique for reducing overfitting during synthetic data generation in settings with limited training data.
Experimental results indicate that IR2 outperforms previous synthetic query generation methods and reduces cost by up to 50%.
Three different regularization methods at different stages of the query synthesis pipeline are explored, offering varying degrees of performance improvement.
The code, prompts, and synthetic data for IR2 are available on GitHub.