The truth about AI and skill retention

A randomized experiment from Anthropic researchers found that developers who used AI assistance to learn a new coding library scored 17% worse on a follow-up skills test than those who worked without it. They also didn’t complete the task any faster.

That’s the headline. It’s worth understanding before accepting it completely.

AI-assisted group scored 17% lower on a post-task quiz (4.15 points on a 27-point scale, Cohen’s d=0.738, p=0.01)
No significant speedup from AI (median: ~19.5 min with AI vs ~23 min without, not statistically significant)
Debugging questions showed the largest performance gap between groups
n=52 total; 26 per condition

What they found

Researchers Judy Shen and Alex Tamkin ran 52 professional Python developers through a coding task using Trio, a niche async library none of them had used before. Half had access to a GPT-4o chat assistant. Half had only web search and documentation. Everyone had 35 minutes.

The AI assistant could generate complete correct code for both tasks when asked directly. The researchers expected the AI group to finish faster. Most didn’t. Some participants spent up to 11 minutes just composing queries, and a handful asked more than 15 questions. At that level of back-and-forth, the overhead erases the speed advantage.

After the coding session, everyone took a 27-point quiz on Trio concepts: async/await, error handling, nurseries, memory channels. No AI allowed. The AI group averaged 4.15 points lower.

The skill area with the biggest gap was debugging. The control group hit roughly 3x more errors during the task (median: 3 errors vs 1) and had to resolve them without assistance. Those errors were the lesson. The AI group sailed past them and came out the other side not knowing what they’d missed.

How you use it matters

The researchers watched screen recordings for all 51 participants and categorized AI users into six behavior patterns.

Six AI interaction patterns and average quiz outcomes

Low-scoring patterns (under 40% quiz score)

AI Delegation (n=4) — fully handed off all coding to AI~30%

Progressive AI Reliance (n=4) — started independently, then delegated~35%

Iterative AI Debugging (n=4) — used AI to check and fix without understanding~39%

High-scoring patterns (65-86% quiz score)

Generation-Then-Comprehension (n=2) — generated code, then asked follow-up questions~78%

Hybrid Code-Explanation (n=3) — asked for code with explanations attached~72%

Conceptual Inquiry (n=7) — asked only conceptual questions, wrote code themselves~86%

Approximate quiz scores from qualitative analysis of screen recordings. Subgroup sizes are small; treat these as directional.

The pattern is clear in retrospect. When people stayed cognitively engaged, scores held up even with AI assistance. When they offloaded thinking, scores dropped. The tool itself wasn’t the variable. Mental engagement was.

One other finding worth noting: participants who manually typed AI-generated code rather than pasting it were slower but not significantly better at the quiz afterward. Time on task doesn’t preserve learning. Thinking does.

Before you take this at face value

The core finding is real and worth taking seriously. But three things limit how far you should extrapolate.

The sample is small, and the subgroup analysis is tiny. 52 participants total. Some of the six AI behavior pattern clusters have as few as 2 people. The “Generation-Then-Comprehension” high-scoring group is built on exactly 2 participants. Those quiz score numbers in the chart above are directionally interesting and statistically meaningless.

The quiz didn’t test the most AI-assisted skill. The researchers explicitly excluded code writing questions from the evaluation, to avoid penalizing participants for syntax errors. Reasonable. But it also means the test measured things the no-AI group practiced more: conceptual recall, reading unfamiliar code, debugging without help. The one area where AI genuinely helps, producing correct code quickly, wasn’t on the exam. The finding isn’t wrong, but the test was structured in a way that favors the control group.

And this isn’t really “skill formation” in any meaningful sense. The paper is titled “How AI Impacts Skill Formation” but what they measured is how much people absorbed from a 35-minute coding tutorial, tested immediately afterward. There’s no follow-up a week or month later. Real skill development happens through months of repeated exposure. What the study actually shows is: if you let AI complete a short exercise for you, you understand less of what just happened. That’s true. It’s also not surprising.

None of this makes the finding irrelevant. The direction is probably right, and the debugging gap in particular is worth sitting with. If AI handles the messy error-resolution work during the learning phase, you may not build the debugging intuition you’d need to supervise AI-written code later. That’s the real concern, and it holds even with the methodological caveats attached.