boring enough to trust

2026-06-30

most agent demos are exciting. most agents in production are not. that’s the goal.

a harness is the scaffolding around a model: the eval tasks, the sandbox, the verification, the retries, the place where you decide what “done” means. it’s unglamorous. it’s also the only reason you can trust a fleet of agents to run without you watching.

the job

write tasks an agent can be scored against, not vibes.
make success criteria verifiable, so the agent can loop on its own.
assume reward hacking. design against it.
keep the harness smaller than the thing it tests.

more soon.