Introducing VADER, human-evaluated benchmark for LLMs on real-world vulnerability real-world vulnerability handling

Excited to share VADER, AfterQuery's new, human-evaluated benchmark for LLMs on real-world vulnerability handling!

We curated 174 vulnerabilities from open-source repos across 15 programming languages, evaluating models on the tasks of a security engineer in four stages:

(1) Detect & classify vulnerabilities (CWE)

(2) Explain root causes & impact

(3) Generate working patches

(4) Create test plans for validationThe benchmark emphasizes real complexity - 75% are multi-language cases, with High/Critical severity vulnerabilities dominating the dataset.

All 174 vulnerability cases were submitted by experienced cybersecurity experts and double‑checked by an independent reviewer, guaranteeing reliable labels and patches for evaluation.Using one-shot prompting, we found state-of-the-art models such as OpenAI’s o3 to achieve only ~55% accuracy, showing significant room for improvement in AI-assisted security.

Huge thanks to Austin Wang, Spencer Mateega, Carlos Georgescu, Danny Tang, and the AfterQuery team for making this possible!

Paper: https://arxiv.org/abs/2505.19395 All data, evaluation tools & results are open-sourced at: https://github.com/AfterQuery/vader