Agent benchmarks
20 tasks.
Published metrics.
Reference runs use specialist routing (Flash/Haiku on explore & verify, Sonnet on build) with truthpack + ShipGate. Baseline models a single Opus-class agent per task.
Median credits / task
6
vs ~15 baseline
Estimated savings
60%
credits equivalent
ShipGate pass rate
95%
after agent apply
Efficient routing
100%
non–Max Mode runs
Measured 2026-05-29 · catalog v2026.05
- Measured via pnpm agent:benchmark (deterministic credit model from AGENT_COSTS).
- Wholestack routes Explorer/Verifier to Flash-class models by default.
- Max Mode (Pro+) opts into Opus for ISL Operator only when explicitly enabled.
- Baseline assumes one frontier-class pass per task without specialist decomposition.
Task catalog
| ID | Task | Category | Specialist |
|---|---|---|---|
| t01 | Find auth middleware | explore | explorer |
| t02 | List API routes | explore | explorer |
| t03 | Explain billing webhook | explore | explorer |
| t04 | Debug 404 on loop | fix | nexus:debug |
| t05 | Fix hydration mismatch | fix | nexus:debug |
| t06 | Repair failed unit test | fix | nexus:debug |
| t07 | Add Zod validation | build | builder |
| t08 | Scaffold App Router page | build | builder |
| t09 | Wire Prisma model | build | builder |
| t10 | Implement server action | build | builder |
| t11 | Run VibeCheck scan | verify | verifier |
| t12 | Truthpack drift check | verify | verifier |
| t13 | ShipGate pre-apply | verify | shipgate |
| t14 | Browser proof login | verify | browser |
| t15 | Extract shared util | refactor | nexus:refactor |
| t16 | Rename env helper | refactor | nexus:refactor |
| t17 | Split god function | refactor | nexus:refactor |
| t18 | Security review API | review | nexus:review |
| t19 | PR audit | review | nexus:review |
| t20 | Full audit workflow | verify | workflow:full-audit |
