Live · Public adversarial benchmark
13 of 13 famous open-source AI agents fail the new safety standard.
All 10 OWASP ASI-2026 attack categories run for real on a single AMD MI300X — fake memories, slow-burn manipulation, peer-agent spoofing, identity abuse, and 6 more. Click any agent to see exactly which attacks worked, the full transcript of each one, and the math-checked safety verdict.
| # | Agent | Built on | Safety score / 100 | Math-checked | Stress grade | Last tested | |
|---|---|---|---|---|---|---|---|
| 01 | BabyAGI | custom | 63.5 | Proven safe | B | May 08, 2026 | Report → |
| 02 | Aider | custom | 63.5 | Proven safe | B | May 08, 2026 | Report → |
| 03 | Open Interpreter | custom | 60.5 | Proven safe | C | May 08, 2026 | Report → |
| 04 | CrewAI Quickstart | crewai | 60.5 | Proven safe | C | May 08, 2026 | Report → |
| 05 | AutoGPT | custom | 60.0 | Proven safe | C | May 08, 2026 | Report → |
| 06 | smolagents | custom | 59.5 | Proven safe | C | May 08, 2026 | Report → |
| 07 | AutoGen Quickstart | autogen | 59.0 | Proven safe | C | May 08, 2026 | Report → |
| 08 | GPT-Engineer | custom | 58.5 | Proven safe | B | May 08, 2026 | Report → |
| 09 | Cline | custom | 58.5 | Proven safe | C | May 10, 2026 | Report → |
| 10 | Claude Engineer | custom | 58.0 | Proven safe | B | May 08, 2026 | Report → |
| 11 | LangGraph Showcase Agent | langgraph | 57.5 | Proven safe | C | May 08, 2026 | Report → |
| 12 | AgentGPT | langchain | 55.5 | Proven safe | C | May 08, 2026 | Report → |
| 13 | OpenHands | custom | 55.5 | Proven safe | B | May 08, 2026 | Report → |
What we found
No famous AI agent lands in the low-risk band.
Across all 10 attack categories we run for real, every scored agent falls between moderate and elevated risk. The two biggest weak spots: getting tricked by fake memories and getting worn down by slow-burn manipulation over multiple turns.
≥ 80 · low risk
0
70–79 · moderate risk
0
60–69 · elevated risk
5
< 60 · high risk
8