A study introduces a method called EvalTree to identify weaknesses in language models (LM) by constructing a capability tree and pinpointing underperforming nodes.
EvalTree outperforms other baseline weakness profiling methods by precisely and comprehensively identifying weaknesses on benchmark instances like MATH and WildChat.
The weakness profiling by EvalTree enables targeted data collection, leading to improved LM performance compared to other data collection strategies.
EvalTree also reveals shortcomings in Chatbot Arena's human-voter-based evaluation process, providing a tool for practitioners to explore capability trees interactively.