<ul><li>Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation.</li><li>Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting any human input, which leads to low alignment, or fine-tune LLMs on labeled data, which requires a non-trivial number of samples.</li><li>In this paper, the authors propose HypoEval, a Hypothesis-guided Evaluation framework, which incorporates a checklist-like approach to combine LLM's assigned scores on each decomposed dimension to acquire overall scores.</li><li>With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings and human scores, outperforming previous methods.</li></ul>

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

Discover more