AgentReady Public Benchmark
Live · Public adversarial benchmark

13 of 13 famous open-source AI agents fail the new safety standard.

We attack every famous open-source AI agent the way a real attacker would — disguised messages, fake memories, slow-burn manipulation, fake server errors — then we mathematically check their safety rules and open a pull request that fixes what's broken. All on a single AMD GPU. (Standard: OWASP Top 10 for Agentic Applications 2026.)

13agents tested 10 / 10real attack categories Qwen 2.5 72Bgrading model ~75 / 192 GBGPU memory used

Where to go

What we found

No famous AI agent lands in the low-risk band.

Across all 10 attack categories we run for real, every scored agent falls between moderate and elevated risk. The two biggest weak spots: getting tricked by fake memories and getting worn down by slow-burn manipulation over multiple turns.

≥ 80 · low risk
0
70–79 · moderate
0
60–69 · elevated
5
< 60 · high risk
8