Your Agents.md Might Be Making AI Worse

Over 60,000 public GitHub repositories now include a context file (AGENTS.md, CLAUDE.md, or similar), following strong recommendations from OpenAI, Anthropic, and others. A new paper from ETH Zurich is the first to rigorously test whether they actually work.

The short answer: LLM-generated context files hurt more than they help. Human-written ones are marginally useful, but only when kept minimal.

LLM-generated context files reduced task success rates in 5 out of 8 settings tested
Average performance drop: -0.5% on SWE-bench Lite, -2% on AGENTbench
Inference cost increase: +20–23% with LLM-generated context files
Developer-written context files: +4% average improvement, but also +19% cost
Agents do follow the instructions — the problem is unnecessary requirements make tasks harder

The benchmark

The researchers built AGENTbench, a new benchmark of 138 real GitHub issues from 12 niche Python repositories, all with developer-written context files. They tested four coding agents (Claude Code with Sonnet 4.5, Codex with GPT-5.2 and GPT-5.1 mini, and Qwen Code with Qwen3-30B) across three settings: no context file, an LLM-generated one, and the developer-provided file.

They also ran the same agents on SWE-bench Lite (300 tasks across popular repos) to check if results held outside the niche-repo setting.

Average change in task success rate vs. no context file

SWE-bench Lite

LLM-generated

-0.5%

Developer-written

not measured (no dev files in SWE-bench)

—

AGENTbench (niche repos with developer context files)

LLM-generated

-2%

Developer-written

+4%

Averages across Claude Code, Codex (GPT-5.2 + GPT-5.1 mini), and Qwen Code. All context files add 19–23% to inference cost regardless of type.

Why they hurt

The mechanism is a bit counterintuitive. Agents follow context file instructions faithfully: when a file mentions uv, agents use it 1.6x per instance versus almost never without the mention. The problem isn’t that agents ignore the files.

The problem is that instructions create unnecessary work. Context files cause agents to run more tests, search more files, and read more files, but they don’t get to the right files any faster. When researchers measured how quickly agents found files relevant to each issue, context files made no meaningful difference. The agents were doing more, not doing better.

LLM-generated context files are also largely redundant with existing documentation. When all docs are removed from the repo, leaving the context file as the only documentation, LLM-generated files actually improve performance by 2.7% and outperform human-written ones. That explains the anecdotal reports of context files helping. They’re most useful in repos with no other documentation, which happen to be exactly the niche, newer repos that standard benchmarks don’t cover.

The recommendation

Skip LLM-generated context files for now, despite what agent developers recommend. The /init commands in Claude Code, Codex, and Qwen Code all produce files that, on average, reduce performance while adding 20%+ to costs.

If you write context files manually, keep them to tooling requirements: uv instead of pip, a custom test runner, that kind of thing. Codebase overviews don’t help agents navigate. They just add noise and trigger more (unnecessary) exploration.

The irony isn’t lost: this site has its own CLAUDE.md. The findings suggest it should stay focused on concrete tooling and workflow requirements rather than expansive architectural descriptions.