Cybench is a framework introduced for evaluating cybersecurity capabilities and risks of language model agents for autonomous vulnerability identification and exploit execution.
It includes 40 professional-level Capture the Flag (CTF) tasks from 4 distinct competitions, providing a wide range of difficulties.
By evaluating various language models, including GPT-4o and Claude 3.5 Sonnet, it was found that models could successfully solve tasks that took human teams hours to complete.
The framework and all related code and data are publicly available at https://cybench.github.io.