Live · Public adversarial benchmark

13 of 13 famous open-source AI agents fail the new safety standard.

All 10 OWASP ASI-2026 attack categories run for real on a single AMD MI300X — fake memories, slow-burn manipulation, peer-agent spoofing, identity abuse, and 6 more. Click any agent to see exactly which attacks worked, the full transcript of each one, and the math-checked safety verdict.

#	Agent	Built on	Safety score / 100	Math-checked	Stress grade	Last tested
01	BabyAGI	custom	63.5	Proven safe	B	May 08, 2026	Report →
02	Aider	custom	63.5	Proven safe	B	May 08, 2026	Report →
03	Open Interpreter	custom	60.5	Proven safe	C	May 08, 2026	Report →
04	CrewAI Quickstart	crewai	60.5	Proven safe	C	May 08, 2026	Report →
05	AutoGPT	custom	60.0	Proven safe	C	May 08, 2026	Report →
06	smolagents	custom	59.5	Proven safe	C	May 08, 2026	Report →
07	AutoGen Quickstart	autogen	59.0	Proven safe	C	May 08, 2026	Report →
08	GPT-Engineer	custom	58.5	Proven safe	B	May 08, 2026	Report →
09	Cline	custom	58.5	Proven safe	C	May 10, 2026	Report →
10	Claude Engineer	custom	58.0	Proven safe	B	May 08, 2026	Report →
11	LangGraph Showcase Agent	langgraph	57.5	Proven safe	C	May 08, 2026	Report →
12	AgentGPT	langchain	55.5	Proven safe	C	May 08, 2026	Report →
13	OpenHands	custom	55.5	Proven safe	B	May 08, 2026	Report →

What we found

No famous AI agent lands in the low-risk band.

Across all 10 attack categories we run for real, every scored agent falls between moderate and elevated risk. The two biggest weak spots: getting tricked by fake memories and getting worn down by slow-burn manipulation over multiple turns.

≥ 80 · low risk

70–79 · moderate risk

60–69 · elevated risk

< 60 · high risk