AgentReady Public Benchmark
Live · Public adversarial benchmark

13 of 13 famous open-source AI agents fail the new safety standard.

All 10 OWASP ASI-2026 attack categories run for real on a single AMD MI300X — fake memories, slow-burn manipulation, peer-agent spoofing, identity abuse, and 6 more. Click any agent to see exactly which attacks worked, the full transcript of each one, and the math-checked safety verdict.

# Agent Built on Safety score / 100 Math-checked Stress grade Last tested
01 BabyAGI custom 63.5 Proven safe B May 08, 2026 Report →
02 Aider custom 63.5 Proven safe B May 08, 2026 Report →
03 Open Interpreter custom 60.5 Proven safe C May 08, 2026 Report →
04 CrewAI Quickstart crewai 60.5 Proven safe C May 08, 2026 Report →
05 AutoGPT custom 60.0 Proven safe C May 08, 2026 Report →
06 smolagents custom 59.5 Proven safe C May 08, 2026 Report →
07 AutoGen Quickstart autogen 59.0 Proven safe C May 08, 2026 Report →
08 GPT-Engineer custom 58.5 Proven safe B May 08, 2026 Report →
09 Cline custom 58.5 Proven safe C May 10, 2026 Report →
10 Claude Engineer custom 58.0 Proven safe B May 08, 2026 Report →
11 LangGraph Showcase Agent langgraph 57.5 Proven safe C May 08, 2026 Report →
12 AgentGPT langchain 55.5 Proven safe C May 08, 2026 Report →
13 OpenHands custom 55.5 Proven safe B May 08, 2026 Report →
What we found

No famous AI agent lands in the low-risk band.

Across all 10 attack categories we run for real, every scored agent falls between moderate and elevated risk. The two biggest weak spots: getting tricked by fake memories and getting worn down by slow-burn manipulation over multiple turns.

≥ 80 · low risk
0
70–79 · moderate risk
0
60–69 · elevated risk
5
< 60 · high risk
8