Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation.
Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting any human input, which leads to low alignment, or fine-tune LLMs on labeled data, which requires a non-trivial number of samples.
In this paper, the authors propose HypoEval, a Hypothesis-guided Evaluation framework, which incorporates a checklist-like approach to combine LLM's assigned scores on each decomposed dimension to acquire overall scores.
With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings and human scores, outperforming previous methods.