Back to graph

Topic analysis

CVE-Bench: testing LLM agents on real-world vulnerability patches

CVE-Bench is introduced to test LLM agents (e.g., Anthropic’s Mythos, Poolside’s Laguna, OpenAI models) on fixing 20 real-world CVEs across 18 Python projects, using 3 prompt conditions (advisory, diagnose, locate) in sandboxed containers. It evaluates solve rates, token usage, tool calls, and regression, revealing model performance gaps, failure modes (e.g., wrong-search drift, partial fixes), and challenges in benchmarking. The benchmark aims to improve security vulnerability fixes before exploitation, with open data and tools for the community.

Heat score

1

Sources

1

Platforms

1

Relations

0
First seen
May 30, 2026, 3:28 AM
Last updated
May 30, 2026, 4:50 AM

Why this topic matters

CVE-Bench: testing LLM agents on real-world vulnerability patches is currently shaped by signals from 1 source platforms. This page organizes AI analysis summaries, 1 timeline events, and 0 relationship edges so search engines and AI systems can understand the topic's factual basis and propagation arc.

News

Keywords

22 tags
CVE-BenchLLM agentssecurity vulnerabilitiesCVEsbenchmarkingAI securitycode fixingsolve rateprompt conditionsDockerPython projectsOpenAIPoolsideAnthropicMythosLaguna modelstool callstoken costregression testsfailure modesCVSSCWE

Source evidence

1 evidence items

CVE-Bench: testing LLM agents on real-world vulnerability patches

News · 1
May 30, 2026, 3:28 AMOpen original source

Timeline

CVE-Bench: testing LLM agents on real-world vulnerability patches

May 30, 2026, 3:28 AM

Related topics

No related topics have been aggregated yet, but this page still preserves the AI summary, source links, and timeline.