BenchHub is a dynamic benchmark repository introduced for evaluating large language models (LLMs) effectively.
It aggregates and classifies benchmark datasets from diverse domains, integrating 303K questions across 38 benchmarks.
BenchHub is designed for continuous updates and scalable data management to enable flexible and customizable evaluation tailored to various domains or use cases.
Extensive experiments with various LLM families show that model performance significantly varies across domain-specific subsets, highlighting the importance of domain-aware benchmarking.