DevTools Summary

Your Agents.md Might Be Making AI Worse

An ETH Zurich study tests whether AGENTS.md and CLAUDE.md files actually help coding agents. LLM-generated context files reduce success rates while adding 20%+ to costs. Human-written ones barely help.

Over 60,000 public GitHub repositories now include a context file (AGENTS.md, CLAUDE.md, or similar), following strong recommendations from OpenAI, Anthropic, and others. A new paper from ETH Zurich is the first to rigorously test whether they actually work.

The short answer: LLM-generated context files hurt more than they help. Human-written ones are marginally useful, but only when kept minimal.

  • LLM-generated context files reduced task success rates in 5 out of 8 settings tested
  • Average performance drop: -0.5% on SWE-bench Lite, -2% on AGENTbench
  • Inference cost increase: +20–23% with LLM-generated context files
  • Developer-written context files: +4% average improvement, but also +19% cost
  • Agents do follow the instructions — the problem is unnecessary requirements make tasks harder

The benchmark

The researchers built AGENTbench, a new benchmark of 138 real GitHub issues from 12 niche Python repositories, all with developer-written context files. They tested four coding agents (Claude Code with Sonnet 4.5, Codex with GPT-5.2 and GPT-5.1 mini, and Qwen Code with Qwen3-30B) across three settings: no context file, an LLM-generated one, and the developer-provided file.

They also ran the same agents on SWE-bench Lite (300 tasks across popular repos) to check if results held outside the niche-repo setting.

Average change in task success rate vs. no context file

SWE-bench Lite

LLM-generated
-0.5%
Developer-written
not measured (no dev files in SWE-bench)
—

AGENTbench (niche repos with developer context files)

LLM-generated
-2%
Developer-written
+4%

Averages across Claude Code, Codex (GPT-5.2 + GPT-5.1 mini), and Qwen Code. All context files add 19–23% to inference cost regardless of type.

Why they hurt

The mechanism is a bit counterintuitive. Agents follow context file instructions faithfully: when a file mentions uv, agents use it 1.6x per instance versus almost never without the mention. The problem isn’t that agents ignore the files.

The problem is that instructions create unnecessary work. Context files cause agents to run more tests, search more files, and read more files, but they don’t get to the right files any faster. When researchers measured how quickly agents found files relevant to each issue, context files made no meaningful difference. The agents were doing more, not doing better.

LLM-generated context files are also largely redundant with existing documentation. When all docs are removed from the repo, leaving the context file as the only documentation, LLM-generated files actually improve performance by 2.7% and outperform human-written ones. That explains the anecdotal reports of context files helping. They’re most useful in repos with no other documentation, which happen to be exactly the niche, newer repos that standard benchmarks don’t cover.

The recommendation

Skip LLM-generated context files for now, despite what agent developers recommend. The /init commands in Claude Code, Codex, and Qwen Code all produce files that, on average, reduce performance while adding 20%+ to costs.

If you write context files manually, keep them to tooling requirements: uv instead of pip, a custom test runner, that kind of thing. Codebase overviews don’t help agents navigate. They just add noise and trigger more (unnecessary) exploration.

The irony isn’t lost: this site has its own CLAUDE.md. The findings suggest it should stay focused on concrete tooling and workflow requirements rather than expansive architectural descriptions.

#devtools #benchmarks #research #coding-agents