<ul data-eligibleForWebStory="true"><li>Contamination in machine learning refers to testing data leaking into the training set, affecting the evaluation of Large Language Models (LLMs) trained on large, opaque text corpora.</li><li>Tools to detect contamination are crucial for fairly tracking LLM performance evolution, especially given their training on web-scraped text.</li><li>Previous studies have addressed contamination quantification in short text sequences, but have limitations leading to impracticality.</li><li>LogProber is introduced as an efficient algorithm to detect contamination in a black box setting, focusing on question familiarity over the answer.</li><li>LogProber aims to address drawbacks in existing methods and highlights the importance of detection algorithms' design in identifying different forms of contamination.</li></ul>

LogProber: Disentangling confidence from contamination in LLM responses

Discover more