boring enough to trust
most agent demos are exciting. most agents in production are not. that’s the goal.
a harness is the scaffolding around a model: the eval tasks, the sandbox, the verification, the retries, the place where you decide what “done” means. it’s unglamorous. it’s also the only reason you can trust a fleet of agents to run without you watching.
the job
- write tasks an agent can be scored against, not vibes.
- make success criteria verifiable, so the agent can loop on its own.
- assume reward hacking. design against it.
- keep the harness smaller than the thing it tests.
more soon.