Skip to main content
Douwe Osinga
Software Engineer
View all authors

Self-Improving Agents Still Need Humans

· 5 min read
Douwe Osinga
Software Engineer

A human engineer reviews an AI agent feedback loop across benchmark dashboards and terminal logs

Goodhart's law is the benchmarker's curse: when a measure becomes a target, it stops being a good measure. Coding-agent benchmarks are almost designed to trigger it. The tasks are public, the result is one number, and the leaderboard inevitably fills up with harnesses that are, often without meaning to be, overfit to the benchmark.

That does not make the go-to standard Terminal-bench useless, but it does change how the goose team uses it. The leaderboard is a noisy measure of general agent ability. The signal is a pattern of failures: places where goose keeps getting stuck or where goose fails and another harness succeeds.

That is also why we usually benchmark with Sonnet rather than the strongest model available. We are not trying to get the largest possible number. We want enough failures left on the table to see what support the agent is missing.