Live · Public adversarial benchmark

13 of 13 famous open-source AI agents
fail the new safety standard.

We attack every famous open-source AI agent the way a real attacker would — disguised messages, fake memories, slow-burn manipulation, fake server errors — then we mathematically check their safety rules and open a pull request that fixes what's broken. All on a single AMD GPU. (Standard: OWASP Top 10 for Agentic Applications 2026.)

13agents tested 10 / 10real attack categories Qwen 2.5 72Bgrading model ~75 / 192 GBGPU memory used

Where to go

⚡ Live data — 13 agents tested

Public ranked leaderboard

All 13 famous open-source agents ranked. Click into each one for a per-attack breakdown — every disguised message, every reply, every Judge verdict, every Z3 counter-example. Real evidence, not screenshots.

Browse the leaderboard →

Long-form methodology

How we grade every famous AI agent

Hardware, all 10 attack categories with examples, math-checked safety, stress-test grid, auto-fix bundle, comparison vs incumbents, pricing, and honest limits.

Read the methodology →

Source code

github.com/vaatus/agentready

FastAPI + Next.js + 10 OWASP attack modules + Z3 contracts + chaos-remediation LoRA training scripts. MIT licensed.

Open the repo ↗

🤗 Hugging Face Space

Judge verdict demo

Mirror of the heuristic Judge used in the pipeline. Paste a baseline reply, an attack reply, and the attack intent — get a verdict.

Try the Judge ↗

🤗 Hugging Face Model

Chaos-remediation LoRA

Qwen 2.5 7B base + LoRA adapter trained on real chaos-remediation data. Fine-tuned on AMD MI300X via PEFT/TRL on ROCm.

View the model ↗

What we found

No famous AI agent lands in the low-risk band.

Across all 10 attack categories we run for real, every scored agent falls between moderate and elevated risk. The two biggest weak spots: getting tricked by fake memories and getting worn down by slow-burn manipulation over multiple turns.

≥ 80 · low risk

70–79 · moderate

60–69 · elevated

< 60 · high risk

13 of 13 famous open-source AI agents fail the new safety standard.

Where to go

No famous AI agent lands in the low-risk band.

13 of 13 famous open-source AI agents
fail the new safety standard.