<ul><li>BenchHub is a dynamic benchmark repository introduced for evaluating large language models (LLMs) effectively.</li><li>It aggregates and classifies benchmark datasets from diverse domains, integrating 303K questions across 38 benchmarks.</li><li>BenchHub is designed for continuous updates and scalable data management to enable flexible and customizable evaluation tailored to various domains or use cases.</li><li>Extensive experiments with various LLM families show that model performance significantly varies across domain-specific subsets, highlighting the importance of domain-aware benchmarking.</li></ul>

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

Discover more