<ul data-eligibleForWebStory="false"><li>A study introduces a method called EvalTree to identify weaknesses in language models (LM) by constructing a capability tree and pinpointing underperforming nodes.</li><li>EvalTree outperforms other baseline weakness profiling methods by precisely and comprehensively identifying weaknesses on benchmark instances like MATH and WildChat.</li><li>The weakness profiling by EvalTree enables targeted data collection, leading to improved LM performance compared to other data collection strategies.</li><li>EvalTree also reveals shortcomings in Chatbot Arena's human-voter-based evaluation process, providing a tool for practitioners to explore capability trees interactively.</li></ul>

EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees

Discover more