<ul><li>Human evaluations are crucial for determining the value, safety, and alignment of AI models with user needs beyond just metrics.</li><li>Ambiguity in evaluation results can arise from lack of agreement among raters, known as Inter Rater Reliability (IRR), which needs to be measured accurately.</li><li>Contradictory results within the same evaluation task or between raters and actual users can indicate issues in evaluation design or alignment with product outcomes.</li><li>Debugging evaluation problems requires alignment with product goals, clear instructions for raters, and ensuring user preferences are accurately represented.</li><li>Dry runs within the team, automated evaluations for certain tasks, and a combination of human and automated ratings can help improve evaluation processes.</li><li>Iterating quickly and starting the evaluation process is essential to identify and address specific issues that may be unique to the product context.</li></ul>

The problems with running human evals

Discover more