Wholestack
Agent benchmarks

20 tasks.
Published metrics.

Reference runs use specialist routing (Flash/Haiku on explore & verify, Sonnet on build) with truthpack + ShipGate. Baseline models a single Opus-class agent per task.

Median credits / task

6

vs ~15 baseline

Estimated savings

60%

credits equivalent

ShipGate pass rate

95%

after agent apply

Efficient routing

100%

non–Max Mode runs

Measured 2026-05-29 · catalog v2026.05

  • Measured via pnpm agent:benchmark (deterministic credit model from AGENT_COSTS).
  • Wholestack routes Explorer/Verifier to Flash-class models by default.
  • Max Mode (Pro+) opts into Opus for ISL Operator only when explicitly enabled.
  • Baseline assumes one frontier-class pass per task without specialist decomposition.

Task catalog

IDTaskCategorySpecialist
t01Find auth middlewareexploreexplorer
t02List API routesexploreexplorer
t03Explain billing webhookexploreexplorer
t04Debug 404 on loopfixnexus:debug
t05Fix hydration mismatchfixnexus:debug
t06Repair failed unit testfixnexus:debug
t07Add Zod validationbuildbuilder
t08Scaffold App Router pagebuildbuilder
t09Wire Prisma modelbuildbuilder
t10Implement server actionbuildbuilder
t11Run VibeCheck scanverifyverifier
t12Truthpack drift checkverifyverifier
t13ShipGate pre-applyverifyshipgate
t14Browser proof loginverifybrowser
t15Extract shared utilrefactornexus:refactor
t16Rename env helperrefactornexus:refactor
t17Split god functionrefactornexus:refactor
t18Security review APIreviewnexus:review
t19PR auditreviewnexus:review
t20Full audit workflowverifyworkflow:full-audit