
Excited to share VADER, AfterQuery's new, human-evaluated benchmark for LLMs on real-world vulnerability handling!
We curated 174 vulnerabilities from open-source repos across 15 programming languages, evaluating models on the tasks of a security engineer in four stages:
(1) Detect & classify vulnerabilities (CWE)
(2) Explain root causes & impact
(3) Generate working patches
(4) Create test plans for validationThe benchmark emphasizes real complexity - 75% are multi-language cases, with High/Critical severity vulnerabilities dominating the dataset.
All 174 vulnerability cases were submitted by experienced cybersecurity experts and double‑checked by an independent reviewer, guaranteeing reliable labels and patches for evaluation.Using one-shot prompting, we found state-of-the-art models such as OpenAI’s o3 to achieve only ~55% accuracy, showing significant room for improvement in AI-assisted security.
Huge thanks to Austin Wang, Spencer Mateega, Carlos Georgescu, Danny Tang, and the AfterQuery team for making this possible!
Paper: https://arxiv.org/abs/2505.19395 All data, evaluation tools & results are open-sourced at: https://github.com/AfterQuery/vader