AI Analysis

AI-Generated Agent Skills Are Pointless

SkillsBench tests whether structured knowledge packages improve LLM agents across 84 tasks. Curated Skills add 16pp. Self-generated Skills add nothing, or make things worse.

When you give an AI agent curated how-to procedures for a task, it performs dramatically better. When you ask the agent to write its own procedures first, it performs the same or worse. The gap between those two (+16.2 percentage points vs. -1.3pp) is the central finding of SkillsBench, a new benchmark from a 40-person research collaboration testing “Agent Skills” across 84 tasks and 7,308 trajectories.

  • Curated Skills raise average pass rate by +16.2pp across 7 model-harness configurations (Claude Code, Gemini CLI, Codex)
  • Self-generated Skills: -1.3pp on average, near-zero and often negative
  • Domain effects range from Healthcare +51.9pp to Software Engineering +4.5pp
  • 2-3 Skills per task outperform 4+ Skills (+18.6pp vs. +5.9pp)
  • Comprehensive documentation-style Skills: -2.9pp (they hurt)
  • Claude Haiku 4.5 with curated Skills (27.7%) beats Claude Opus 4.5 without Skills (22.0%)

What Agent Skills Are

Agent Skills are structured packages of procedural knowledge: a SKILL.md file plus optional scripts, templates, and worked examples, all injected into an agent’s context at runtime. Think of them as SOPs: “here’s how to analyze clinical lab data,” “here’s the workflow for SEC filing comparison.” Unlike RAG (factual retrieval) or system prompts (unstructured), Skills are procedural, modular, and portable across models.

The ecosystem has grown fast. The paper catalogued 47,150 unique Skills across GitHub, Smithery.ai, and corporate sources, with a surge to over 18,000 daily additions in January 2026. Despite that growth, no benchmark had systematically measured whether they actually help. SkillsBench is the first to try.

Curated vs. Self-Generated: The Central Finding

The benchmark tests each of 84 tasks under three conditions: no Skills, curated Skills (human-authored), and self-generated Skills (the agent writes its own procedural docs before solving).

Average Pass Rate by Condition (84 tasks, 7 model-harness configurations)

Curated Skills40.6%
No Skills (baseline)24.3%
Self-Generated Skills21.0%

Self-generated condition evaluated on Claude Code (all 4 models) and Codex only. Gemini CLI not supported.

Curated Skills work. Self-generated Skills don’t.

The failure mode isn’t that models refuse to generate Skills. They do generate them. But they produce imprecise or incomplete procedures: “use pandas for data processing” without specific API patterns, or generic scaffolding that restates what the task already says. For high-domain-knowledge tasks like manufacturing workflows or clinical data harmonization, models often don’t recognize they need specialized procedures at all and just attempt solutions with general-purpose approaches.

If you were hoping to automate Skills authoring, that’s the bad news. The procedural knowledge that most helps agents is exactly the kind models can’t reliably reconstruct from their weights. You have to write it yourself.

Domain Effects Are Everything

The average +16.2pp hides enormous variance. Healthcare gains +51.9pp and Manufacturing +41.9pp. Software Engineering, where models have the deepest pretraining coverage, gains only +4.5pp.

Skills Improvement by Domain (percentage points, averaged across all configurations)

Healthcare+51.9pp
Manufacturing+41.9pp
Cybersecurity+23.2pp
Natural Science+21.9pp
Energy+17.9pp
Office & White Collar+17.8pp
Finance+15.1pp
Media & Content+13.9pp
Robotics+7.0pp
Mathematics+6.0pp
Software Engineering+4.5pp

The pattern is consistent: Skills matter most where required procedural knowledge is underrepresented in model pretraining. Clinical data harmonization workflows, manufacturing defect codebooks, SEC filing analysis procedures: models haven’t seen enough of these in training. Git workflows and standard coding patterns are another story.

If you’re building agents for specialized professional domains, curated Skills are probably the highest-leverage improvement available. A few well-written SKILL.md files can beat a model upgrade for niche procedural tasks.

One caveat worth noting: 16 of 84 tasks showed negative deltas even with curated Skills. Skills can hurt when they introduce conflicting guidance or unnecessary complexity on tasks models already handle well.

Shorter, Focused Skills Beat Comprehensive Ones

Two design results, both counterintuitive.

First, 2-3 Skills per task outperform 4+ Skills. The gain with 2-3 Skills is +18.6pp; with 4+ it drops to +5.9pp. More context doesn’t compound, it competes.

Second, comprehensive documentation-style Skills actually hurt performance (-2.9pp) while focused, detailed Skills help the most (+18.8pp). Agents struggle to extract relevant guidance from lengthy Skills content; the context window fills without providing actionable direction.

Skills ComplexityImprovement
Detailed (focused)+18.8pp
Compact+17.1pp
Standard+10.1pp
Comprehensive-2.9pp

Write Skills like a good runbook, not a wiki. Two focused guides with working examples beat one exhaustive document. The instinct to be thorough works against you here.

Takeaways

Don’t auto-generate Skills. Asking the agent to write its own procedures before solving isn’t a shortcut. It’s near-zero improvement at best, often a net negative. The self-generated Skills result is the most useful finding in the paper, because it’s the most tempting thing to try.

Target niche domains first. Skills provide the most lift where model pretraining is thinnest: healthcare, manufacturing, specialized finance workflows. Spending time writing Skills for software engineering tasks is probably not the best use of that effort.

Keep Skills short and specific. 2-3 focused modules with working examples. Comprehensive documentation eats context budget without proportionate benefit. If you catch yourself writing a thorough reference guide, you’ve gone too far.

The benchmark covers 11 domains, 84 tasks, and 7,308 trajectories with deterministic verifiers. That paired testing methodology (with vs. without Skills on the same tasks) is something other agent benchmarks should adopt.

#agents #benchmarks #research #llm