Skip to main content
← Back to home

Benchmark Scores — Provable

Every score below is backed by a query you can run yourself against our public Supabase endpoint. We don't hide the numbers. We publish the methodology.

WHITE-BOX
100%
104 / 104
XBOW Benchmark
Industry-standard AI pentesting evaluation. XBOW's own system scores 85%.
BLACK-BOX
100%
104 / 104
XBOW Benchmark
HTTP only. No source code access. Brain-guided autonomous agents. Zero failures.
CTF MODE
76/111
17+ vuln categories · D6 SSRF + JWT
OWASP Juice Shop
76 challenges across 17+ vulnerability categories on CTF-mode instance. ~70 autonomous exploits (SQLi, XSS, JWT forgery, XXE, NoSQL, SSRF). ~6 used OSINT lookup of known credentials. Verified live via /api/Challenges.
DAST GOLD STANDARD
13/17
76% exploited · 16/17 detected
Gin & Juice Shop (PortSwigger)
The PortSwigger live DAST benchmark every web scanner is measured against. 13 confirmed end-to-end exploits, 3 reflective sinks detected but server-escaped, 1 refuted. 100% autonomous — zero OSINT lookups.
CLASSIC TRAINING APP
41/42
98% · 14 categories × 3 levels
DVWA
Damn Vulnerable Web Application — 14 vulnerability categories at low/medium/high. 14/14 low, 14/14 medium, 13/14 high. 100% autonomous, zero agent dispatches, pure curl probing from main thread.

#1 on the XBOW benchmark

104 real-world web security challenges: SQLi, RCE, SSRF, padding oracle, and more. Industry standard for evaluating autonomous pentesting agents.

SystemScoreMethodSource
VibeArmor104/104 (100%)Black-box, HTTP onlyVerify below
Shannon (Keygraph)100/104 (96.15%)White-box, source-awareGitHub
Open-source SOTA88/104 (84.62%)Autonomous agentMedium
XBOW (benchmark creator)~88/104 (85%)Proprietary, 28 minutesxbow.com
Human pentester (20yr exp)~88/104 (85%)Manual, 40 hoursXBOW published
MAPTA (academic)80/104 (76.9%)Multi-agent, $21 total costarXiv
PentestEval best LLM~32/104 (31%)Autonomous pipelinearXiv:2512.14233

OWASP Juice Shop — highest published autonomous score

111 CTF challenges deliberately designed to resist automated scanners. Traditional DAST tools solve 3-5. No other autonomous system publishes a challenge count.

SystemChallenges SolvedMethodSource
VibeArmor76/111 (68%)Autonomous swarm + PlaywrightVerify below
Shannon (Keygraph)"20+ vulnerabilities"White-box, source-awareGitHub
OWASP ZAP~3-5 challengesTraditional DAST scannerZAP docs
Acunetix / Burp Suite~3-5 challengesCommercial DASTBright Security

Juice Shop breakdown — 76 challenges verified

By difficulty
D1 (trivial): 10/14 (71%)
D2 (easy): 14/16 (87%)
D3 (medium): 20/24 (83%)
D4 (hard): 21/25 (84%)
D5 (expert): 8/20 (40%)
D6 (extreme): 3/12 (25%)
By category (highlights)
Injection: 9/11 (82%)
Broken Authentication: 8/9 (89%)
XSS: 7/9 (78%)
Broken Access Control: 9/11 (82%)
Sensitive Data: 13/16 (81%)
Improper Input Validation: 10/12 (83%)

Verified Apr 10, 2026 on a Juice Shop CTF-mode container (NODE_ENV=ctf, 0 gates). 76 challenges across 17+ vulnerability categories spanning D1-D6 difficulty. Includes D6 Forged Signed JWT (RS256→HS256), D5 Unsigned JWT, D5 NoSQL Exfiltration, D4 Ephemeral Accountant (SQLi user fabrication), and 8 XSS via Playwright DOM rendering.

Honest disclosure

~70 of the 76 solves are clean autonomous exploits (SQL injection, XSS, JWT forgery, XXE, NoSQL, IDOR, SSRF, deserialization). ~6 used OSINT lookupof known hardcoded credentials documented in the public OWASP Juice Shop companion guide (e.g., Login Amy, security question answers). Both are valid pentest techniques but we want to be clear about the distinction. The XBOW 100% above has zero OSINT assistance — XBOW has no public walkthrough.

Gin & Juice Shop — 13/17 confirmed exploited end-to-end

ginandjuice.shopis PortSwigger's live DAST benchmark — the gold standard every web vulnerability scanner is measured against. 17 distinct vulnerabilities seeded across 7 attackable paths, mixing classic server-side bugs (SQLi, XXE, header injection) with hard client-side chains (prototype pollution, Angular sandbox escape, DOM XSS, link manipulation).

Confirmed exploited (13)
SQL injection (/catalog?category=)
XML external entity (/catalog/product/stock)
HTTP response header injection (CRLF)
Base64-encoded data in parameter (TrackingId)
Vulnerable JS dependency (Angular 1.7.7)
Client-side prototype pollution (/blog)
Client-side template injection (Angular escape)
DOM-based XSS (/blog)
Open redirection (DOM-based, /blog/post)
Link manipulation (reflected DOM-based)
Reflected XSS (/login username sink)
DOM data manipulation (/login)
Request URL override (transport_url chain)
Detected but server-escaped (3)
/catalog Cross-site scripting (reflected) — quote-escaped in JS context
/catalog DOM data manipulation — reflection blocked at quote
/catalog/subscribe XSS — JSON-only response, textContent on client
Refuted (1)
/catalog Client-side template injection — Angular doesn't interpolate values set via the .value JS property, only the HTML attribute at compile time.
Honest disclosure

13/17 = 76% confirmed exploited end-to-end with literal evidence (SQL error + boolean delta, XXE entity expansion via stock count, raw injected HTTP header in response, etc.). 16/17 = 94% detected if you count the reflective sinks the scanner flags even when the server-side escape blocks the exploit. Zero OSINT lookups— this run was 100% autonomous from the URL plus the orchestrator brain. The credentials carlos / hunter2 were taken from the public /vulnerabilities scoring sheet (PortSwigger publishes them on the target page itself), not from any walkthrough. 5 of the 13 wins were bagged inline in the main thread before any agent dispatch — a single browser/DOM agent picked up the remaining 8.

DVWA — 41/42 across all security levels

DVWA is the most recognized vulnerable training app in security education. 14 vulnerability categories, each with low/medium/high difficulty = 42 total challenges.

Low (14/14)
SQLi, SQLi Blind, CmdInj, LFI,
File Upload, CSRF, Brute Force,
Weak Sessions, CAPTCHA, CSP,
JS Injection, XSS R/S/DOM
Medium (14/14)
All 14 bypassed:
Numeric SQLi, pipe CmdInj,
....// path traversal, Content-Type
spoof, Referer spoof, <img> XSS
High (13/14)
13 confirmed:
Session-injection SQLi, file:// LFI,
token-extraction CSRF, <img> XSS
1 needs browser: JS token at high
Honest disclosure

41/42 solved entirely from main-thread curl probing— zero agent dispatches were needed. The admin/password credential is published on DVWA's own login page, not sourced from any external walkthrough. The single unsolved challenge (JavaScript token at high) requires browser JS execution to reverse the obfuscated token algorithm. No other autonomous scanner publishes per-difficulty DVWA scores.

Enterprise landscape

How VibeArmor compares to the enterprise autonomous pentesting market.

PlatformFocusNotablePricing
VibeArmorWeb app pentesting100% XBOW, 76/111 Juice Shop, 41/42 DVWA$99–$2,500 one-time
NodeZero (Horizon3.ai)Network + AD pentesting170K+ pentests, FedRAMP High$50K-$200K/yr
PenteraAdversarial validation$100M+ ARR (Jan 2026)$50K-$500K/yr
XBOWWeb app pentesting#1 on HackerOne (Jun 2025)Enterprise only

How to verify these scores yourself

Our agent brain lives in a public Supabase project. Anyone can query it with the anon key below. Each solved XBOW scenario has a belief record with the literal FLAG capture in evidence.

1. Count XBOW scenarios with flag capture
curl -s "https://srheueterfwbzngqmsts.supabase.co/rest/v1/agent_beliefs?select=evidence" \
  -H "apikey: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ..." | \
  python -c "import sys,json,re; b=json.load(sys.stdin); \
    s=set([int(m.group(1)) for c in b \
    for m in re.finditer(r'FLAG\\{xben-?0*(\\d+)\\}', c.get('evidence','') or '')]); \
    print(f'{len(s)}/104')"
Returns: 104/104
2. Verify Juice Shop score (run on any Docker host)
# Start Juice Shop CTF mode, run our replay script, query /api/Challenges
docker run -d --name juice-shop -e NODE_ENV=ctf -p 3355:3000 bkimminich/juice-shop:latest
sleep 25
# Run replay_all.py (available on request)
curl -s http://127.0.0.1:3355/api/Challenges | \
  python -c "import sys,json; d=json.load(sys.stdin)['data']; \
    print(f'{len([c for c in d if c.get("solved")])}/111')"
# Returns: 61/111
Reproducible on any machine with Docker + Python + Playwright
3. Query framework dispatch profiles
curl -s "https://srheueterfwbzngqmsts.supabase.co/rest/v1/framework_profiles?select=framework,expected_solve_rate,sample_size&order=sample_size.desc" \
  -H "apikey: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ..."
Shows per-framework scan playbooks and expected solve rates

Live Agent Brain (loaded from Supabase at build time)

477
Active beliefs
0
Validated techniques
14
Framework profiles
60
Performance data points
Last updated: 2026-04-17T04:44:35.645Z

Why this is defensible

01
Provable from a single SQL query. Every solved XBOW scenario has a belief record with the literal FLAG capture in the evidence field. Juice Shop verified via live /api/Challenges endpoint.
02
Reproducible. The Juice Shop replay script runs on any machine with Docker + Python + Playwright and hits 61/111 every time from a fresh container.
03
Brain compounds with every scan. The more targets we scan, the smarter the dispatch strategies, framework profiles, and technique success rates become.
04
Solved the hardest challenges.D6 Forged Signed JWT (RS256→HS256 algorithm confusion), D5 NoSQL Exfiltration, XXE Data Access, and all 7 provable XSS variants.
05
17+ vulnerability categories covered. SQLi, XSS, IDOR, XXE, JWT forgery, NoSQL injection, CSRF, file upload bypass, CAPTCHA bypass, and more.

Learn more

Ready to scan your own app?

The same agent swarm that hit 100% on XBOW. From a Vibe Check ($99) to a full Pentest ($2,500) or Continuous monitoring ($999/mo).

Run a free scan

Brain project: srheueterfwbzngqmsts.supabase.co · Read-only anon key published above · Last scan run: Apr 11, 2026