<ul><li>Cybench is a framework introduced for evaluating cybersecurity capabilities and risks of language model agents for autonomous vulnerability identification and exploit execution.</li><li>It includes 40 professional-level Capture the Flag (CTF) tasks from 4 distinct competitions, providing a wide range of difficulties.</li><li>By evaluating various language models, including GPT-4o and Claude 3.5 Sonnet, it was found that models could successfully solve tasks that took human teams hours to complete.</li><li>The framework and all related code and data are publicly available at https://cybench.github.io.</li></ul>

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Discover more