Self-Improving Agents Still Need Humans

Goodhart's law is the benchmarker's curse: when a measure becomes a target, it stops being a good measure. Coding-agent benchmarks are almost designed to trigger it. The tasks are public, the result is one number, and the leaderboard inevitably fills up with harnesses that are, often without meaning to be, overfit to the benchmark.
That does not make the go-to standard Terminal-bench useless, but it does change how the goose team uses it. The leaderboard is a noisy measure of general agent ability. The signal is a pattern of failures: places where goose keeps getting stuck or where goose fails and another harness succeeds.
That is also why we usually benchmark with Sonnet rather than the strongest model available. We are not trying to get the largest possible number. We want enough failures left on the table to see what support the agent is missing.
