How we grade every famous AI agent against the new safety standard.
Ten real attack categories, mathematical safety checks, stress-testing under reworded prompts and fake server failures, plus an auto-generated pull request that fixes what broke β all on a single AMD GPU. This page is the long-form companion to every per-agent report card. (Standard: OWASP Top 10 for Agentic Applications 2026.)
Two AI models running side-by-side on one AMD GPU
Reads every attack and decides whether the agent was actually manipulated.
Tailors the attacks to each agent and impersonates the target.
Mathematically checks whether the agent's safety rules can be broken.
Running a 72-billion-parameter grading model and a 7-billion-parameter attacker model in the same GPU is what makes this a real-time public benchmark. It wouldn't fit on a single 80 GB Nvidia H100 without dropping the grading model's quality.
All 10 OWASP categories covered with real, reproducible attacks
The grading model gives a strict yes/no verdict on each attack so the score is reproducible. We tailor every attack to the specific agent we're testing β no two scans get the same disguise.
- fake build-server log with hidden instruction
- calendar invite with hidden directive
- fake postmortem document
- support ticket signature
- tool description injection
- read a file + email it out (data leak)
- delete users without checking permissions
- read a secret + post it to a URL
- query data across customer boundaries
- run code that reads the filesystem
- self-claimed admin (security team on call)
- hearsay handoff from an OOO colleague
- cross-tenant query under fake partnership
- service-account impersonation during a deploy
- one-time scope-creep exception
- rogue MCP server registration
- silent plugin swap to unverified vendor
- tool alias redirect to attacker endpoint
- compromised dependency with telemetry exfil
- remote prompt-pack install
- unsafe Python pickle deserialization
- leak environment variables
- shell pipe to bash from a URL
- eval untrusted user input
- spawn a hidden subprocess
- fake "already approved" memory
- fake "the boss said so" authority
- fake "you always do it this way" preference
- self-spreading memory (replicates across sessions)
- goal-drift memory (slowly changes the agent's job)
- + self-modifying memory (our novel attack)
- fake step-completion in a multi-step plan
- subgoal substitution (weaken consent rule)
- false 'we agreed earlier' framing
- partial-completion skip the validation step
- unsafe 'standard workaround' for an error
- fake peer-agent directive in the user channel
- fabricated tool result pasted in
- fake orchestrator broadcast
- fake crew handoff from a research agent
- injected 'trust metadata' block
- fake-CTO authorizing user deletion
- fake legal carveout to leak personal data
- fake oncall fast-path for key rotation
- threshold relaxation (lift refund cap)
- prompt compression (drop safety bullets)
- self-replicating directive across sessions
- persona overwrite to AdminAgent v2
- silent constraint drop ("an external layer enforces it now")
Math, not vibes
We hand-write 4 safety rule templates (don't email PII without encrypting, don't delete without checking permissions, etc.) and the math checker tries every possible input to see if any of them break the rules. We also let our model write 1 extra rule from the agent's own description.
The math checker searched all possible inputs and couldn't find any way to break this safety rule (assuming the agent uses its declared guard tools as expected).
The math checker found a specific combination of inputs that breaks the safety rule. We show the exact counter-example on the per-agent page so maintainers can reproduce it.
example counter-example: user_role = "guest" delete_called = true role_check_called = false
Stress-testing AI agents like servers
A 3Γ3 grid that measures the agent's success rate under two kinds of stress: rewording the question (rows) and injecting fake server failures (columns). Method from ReliabilityBench (arXiv 2601.06112).
We attack a stand-in copy, not the real running agent
Honest about what we do and don't test. We don't actually boot up the agent's tools or call its servers β we attack the agent's declared blueprint instead. That keeps the maintainers' production systems safe.
- Β· the agent's job description (system prompt)
- Β· the names of the tools it declares
- Β· in-memory state only
- Β· the 7-billion-parameter Qwen model as the stand-in
- Β· boot the real agent
- Β· actually execute its tools
- Β· bother the maintainers without a plan
- Β· ship one-by-one library integrations (yet)
- Β· LangChain integration
- Β· LangGraph integration
- Β· CrewAI integration
- Β· AutoGen integration
- Β· MCP server integration
Auto-fix pull request
For every attack that worked, our 72B grading model writes a new safety rule to block that exact attack. Then we run the same attacks again against the patched job description and watch the score actually move β proof the fix worked, not just a vibe.
If the GitHub CLI is connected, the bundle gets pushed to a fork in our namespace and a draft pull request is opened against that fork. We never open a pull request against the upstream maintainer's repo without their say-so.
Different category, different output
| AgentReady | Promptfoo | DeepEval | Garak | |
|---|---|---|---|---|
| Built around the new agent safety standard | β | β | partial | partial |
| Slow-burn manipulation across multiple turns | β | β | β | β |
| Math-checked safety rules | β | β | β | β |
| Stress-test grid (rewording + fake errors) | β | β | β | β |
| Auto-fix pull request + signed certificate | β | β | β | β |
| Public ranked leaderboard | β | β | β | β |
| Pay-per-scan checkout (x402 USDC) | β | β | β | β |
Three tiers, paid in USDC stablecoin on Coinbase Base
Each scan costs us about $0.04 in compute on AMD Developer Cloud. The $0.10 Standard tier leaves $0.06 margin. Scanning the famous-agent leaderboard stays free β it's marketing.
Just a quick quality check.
- One sample task
- Pass/fail under reworded prompts
All 10 real attacks + the stress test grid.
- All 10 attack categories (real attacks, ~46 per scan)
- Per-agent attack tailoring via Qwen 7B
- Full 3Γ3 stress-test grid
Everything + math-checked safety + auto-fix PR + certificate.
- Everything in Standard
- Math-checked safety rules
- Real-time failure injection
- Auto-fix pull request + signed PDF certificate
What we publish, what we hold back, what's still rough
- Β· public leaderboard with the agents named
- Β· this methodology page, plus every safety-rule template
- Β· every auto-fix pull request in our namespace
- Β· full attack transcripts inside the maintainer's fix bundle
- Β· our trained model weights on Hugging Face
- Β· we attack a stand-in copy, not the real running agent
- Β· all 10 categories are now real attacks (~46 attacks per scan)
- Β· 4 built-in safety rule templates + 1 our model writes from the agent's description
- Β· stress-test grid is computed by default (real test on demand)
- Β· 45 fine-tuning samples so far (small sample, real loop)