When AI Helps and When It Hurts: A Critical Look at Anthropic's Skill Formation Study
Anthropic's recent study found AI users scored slightly lower on a coding quiz. Here's what the research actually shows, where the methodology falls short, and how to use AI without sacrificing learning.
Anthropic recently published a randomized controlled trial on how AI assistance affects the formation of coding skills. The headline result: developers who used AI to complete tasks with a new Python library scored 17% lower on a post-task quiz (50% vs 67%) than those who coded by hand. They also finished about two minutes faster. The question worth asking is whether AI is making developers "dumber," or whether the study design is driving that conclusion.
What Anthropic Actually Did
The study recruited 52 developers, mostly junior, with at least a year of Python "experience" and some familiarity with AI coding tools. None had used the Trio library before. Participants were randomized into two groups: one could use an AI assistant during the main task, the other could not. Both groups had 35 minutes to complete two Trio-based coding tasks, were told they would be quizzed afterward, and were encouraged to work as fast as possible.
The AI group scored lower on the quiz. The largest gap appeared on debugging questions. That is not incidental. Debugging is the skill that lets you validate AI output, trace failure modes, and own systems you did not write. If AI use erodes debugging ability, we are training a generation that cannot supervise the tools they deploy. The paper explicitly frames this around "scalable oversight": humans need to be able to catch when AI-generated code is wrong.
Importantly, how you use AI mattered. High scorers (65% or above) asked conceptual questions, requested explanations, or used a "generation-then-comprehension" approach. Low scorers (under 40%) delegated coding to AI or relied on it for debugging without understanding. The takeaway: not all AI reliance is the same.
Caveats
Before drawing strong conclusions, a few design choices are worth flagging.
Time pressure. Thirty-five minutes for two tasks, with an impnding quiz, creates an incentive to optimize for finishing over understanding. Rational developers in that setup would prioritize completion. In practice, sprint culture and deadlines create similar incentives, so the finding may generalize. But the artificial cap may have amplified the effect.
Quiz scores. Both groups scored low. The no-AI group averaged 67%, the AI group 50%. Neither showed strong mastery. The difference is real, but neither outcome suggests deep understanding.
Sample size. Fifty-two participants total, 26 per arm. Small for broad generalization. The qualitative analysis grouped participants into six interaction patterns, but some clusters were tiny: "Generation-then-comprehension" had n=2, "Hybrid code-explanation" had n=3. Drawing firm conclusions from two people is a stretch. The interaction-pattern findings are suggestive, not conclusive.
Junior-only bias. The cohort was mostly junior developers. Experienced engineers code in a wildly different way and may use AI differently. The study does not tell us how senior engineers would do with the same conditions.
New tool overhead. Participants used a web-based AI chat embedded in a coding platform. This seems more like an analog for how AI was used 2 years ago and not how devs use AI today. They were learning the tool while under time pressure, which adds another variable that separates the conditions of this experiment from contemporary reality.
Two minutes is meaningful. The AI group finished about two minutes faster. On a 20 to 23 minute task, that is roughly 8 to 10% faster. The study reported the productivity gain as not statistically significant, but in practice it may understate real-world impact. It could also make a larger impact depending on where those 2 minute gains happened.
Error Exposure as Pedagogy
The control group encountered more errors (median 3 vs 1) and resolved them independently. Many were Trio-specific: RuntimeWarning when a coroutine was never awaited, TypeError when a coroutine object was passed where an async function was expected. Fixing those forces engagement with how the library actually works. The AI group largely avoided those errors, which may have short-circuited learning. Getting stuck and working through it is often where real understanding forms. The study design may have inadvertently penalized the AI group by giving them fewer chances to fail and recover.
The study also compared participants who pasted AI-generated code directly versus those who manually typed it. There was no notable difference in quiz scores. Pasters (n=9) finished fastest; manual typers (n=9) learned no better. Cognitive effort matters more than physical effort. Manually retyping AI output does not seem to improve understanding; asking conceptual questions does. Who manual types instead of copying and pasting? With this data and other studies showing that AI-enhanced capabilities often do not persist when the tool is removed, why would anyone manually type when it does not actually improve retention or understanding?
The 80% vs 0% Productivity Split
Anthropic's prior observational work found AI can speed up some tasks by 80%. This study found no statistically significant productivity gain. The key difference: prior work measured tasks where participants already had the relevant skills; this study measured learning a new library. The emerging model: AI accelerates you when the domain is familiar, but can hinder you when you are learning. That distinction should inform when and how teams use AI, for example routine implementation versus onboarding to a new framework.
The paper also notes that the setup used chat-based help, not agentic coding products like Claude Code or Cursor. Tools where the AI writes directly into the codebase may create even stronger cognitive offloading. Worth considering when rolling out agentic tools to junior-heavy teams.
The Scalable Developer Counterargument
Some developers argue that AI does not make them dumber; it makes them more scalable. They offload low-level syntax so they can focus on architecture and problem-solving. The study measured immediate comprehension of a new library, not long-term architecture skills. As developers gain experience with AI workflows (prompting, reviewing, integrating), efficiency may compound. The research is specifically about learning new skills, not about applying existing ones.
Practical Takeaways
What does this mean for teams?
First, do not delegate understanding when learning something new. Ask for explanations. Pose conceptual questions. Use AI to deepen comprehension, not to bypass it. This mirrors how effective code review works: you do not just merge; you ask why.
Second, learning modes exist for a reason. Claude Code Learning, ChatGPT Study Mode, and similar features are designed to foster understanding. Use them when the goal is skill acquisition, not just throughput.
Third, managers should design for learning, not just delivery. Especially for junior developers, time pressure plus AI access may optimize for completion at the cost of long-term skill. When the system rewards completion over comprehension, AI becomes a shortcut that trades depth for speed.
Fourth, the research is valuable even with its caveats. The design may have amplified the negative effect, but the signal is real: cognitive engagement matters. Participants in the AI group left feedback like "I got lazy" and "there are still a lot of gaps in my understanding." Many engineers will recognize that feeling: I shipped it, but I do not fully own it.
The Bottom Line
AI can help and hinder. The question is when and how. This study suggests that when you are learning something new, treating the AI as a tutor rather than a substitute preserves understanding. When you already know the domain, AI may deliver the 80% speedup that prior research observed. The distinction matters for how we adopt these tools, how we onboard new engineers, and how we design systems that reward both productivity and craft.
Anthropic's article and the full paper are worth reading in full.