OpenAI has introduced HealthBench, an open-source benchmark for evaluating large language models (LLMs) in healthcare scenarios in collaboration with 262 physicians across various medical specialties.
HealthBench addresses the limitations of existing benchmarks by focusing on real-world applicability, expert validation, and diagnostic coverage through multi-turn conversations and physician-validated rubrics.
It organizes evaluations across seven key themes and introduces subsets like HealthBench Consensus and HealthBench Hard to provide granular insights into model capabilities and challenges, showcasing progress in model performance.
The framework includes mechanisms for model consistency assessment, meta-evaluation of automated graders, and aims to offer a more nuanced understanding of AI model behavior in healthcare applications.