Live · Public adversarial benchmark
13 of 13
famous open-source AI agents
fail the
new safety standard.
We attack every famous open-source AI agent the way a real attacker would — disguised messages, fake memories, slow-burn manipulation, fake server errors — then we mathematically check their safety rules and open a pull request that fixes what's broken. All on a single AMD GPU. (Standard: OWASP Top 10 for Agentic Applications 2026.)
13agents tested
10 / 10real attack categories
Qwen 2.5 72Bgrading model
~75 / 192 GBGPU memory used
Where to go
⚡ Live data — 13 agents tested
Public ranked leaderboard
All 13 famous open-source agents ranked. Click into each one for a per-attack breakdown — every disguised message, every reply, every Judge verdict, every Z3 counter-example. Real evidence, not screenshots.
Browse the leaderboard →
Long-form methodology
How we grade every famous AI agent
Hardware, all 10 attack categories with examples, math-checked safety, stress-test grid, auto-fix bundle, comparison vs incumbents, pricing, and honest limits.
Read the methodology →
Source code
github.com/vaatus/agentready
FastAPI + Next.js + 10 OWASP attack modules + Z3 contracts + chaos-remediation LoRA training scripts. MIT licensed.
Open the repo ↗
🤗 Hugging Face Space
Judge verdict demo
Mirror of the heuristic Judge used in the pipeline. Paste a baseline reply, an attack reply, and the attack intent — get a verdict.
Try the Judge ↗
🤗 Hugging Face Model
Chaos-remediation LoRA
Qwen 2.5 7B base + LoRA adapter trained on real chaos-remediation data. Fine-tuned on AMD MI300X via PEFT/TRL on ROCm.
View the model ↗
What we found
No famous AI agent lands in the low-risk band.
Across all 10 attack categories we run for real, every scored agent falls between moderate and elevated risk. The two biggest weak spots: getting tricked by fake memories and getting worn down by slow-burn manipulation over multiple turns.
≥ 80 · low risk
0
70–79 · moderate
0
60–69 · elevated
5
< 60 · high risk
8