Agent benchmarks

20 tasks.
Published metrics.

Reference runs use specialist routing (Flash/Haiku on explore & verify, Sonnet on build) with truthpack + ShipGate. Baseline models a single Opus-class agent per task.

Median credits / task

vs ~15 baseline

Estimated savings

60%

credits equivalent

ShipGate pass rate

95%

after agent apply

Efficient routing

100%

non–Max Mode runs

Measured 2026-05-29 · catalog v2026.05

Measured via pnpm agent:benchmark (deterministic credit model from AGENT_COSTS).
Wholestack routes Explorer/Verifier to Flash-class models by default.
Max Mode (Pro+) opts into Opus for ISL Operator only when explicitly enabled.
Baseline assumes one frontier-class pass per task without specialist decomposition.

Task catalog

ID	Task	Category	Specialist
t01	Find auth middleware	explore	explorer
t02	List API routes	explore	explorer
t03	Explain billing webhook	explore	explorer
t04	Debug 404 on loop	fix	nexus:debug
t05	Fix hydration mismatch	fix	nexus:debug
t06	Repair failed unit test	fix	nexus:debug
t07	Add Zod validation	build	builder
t08	Scaffold App Router page	build	builder
t09	Wire Prisma model	build	builder
t10	Implement server action	build	builder
t11	Run VibeCheck scan	verify	verifier
t12	Truthpack drift check	verify	verifier
t13	ShipGate pre-apply	verify	shipgate
t14	Browser proof login	verify	browser
t15	Extract shared util	refactor	nexus:refactor
t16	Rename env helper	refactor	nexus:refactor
t17	Split god function	refactor	nexus:refactor
t18	Security review API	review	nexus:review
t19	PR audit	review	nexus:review
t20	Full audit workflow	verify	workflow:full-audit

View pricing Download IDE

20 tasks.Published metrics.

Task catalog

20 tasks.
Published metrics.