AI Coding Agents Reward Domain Expertise, Not Coding Skill: Anthropic Study of 400K Sessions - AI

7 x 24 Track global technological trends

Hot Topic

Day

News Topic

AI Coding Agents Reward Domain Expertise, Not Coding Skill: Anthropic Study of 400K Sessions

23 hour ago / Read about 32 minute

Source：TechTimes

Claude Code Anthropic.com

The question hovering over every professional who has watched a colleague build something with an AI coding agent — and wondered whether that was now their job too — finally has a large-scale empirical answer. Anthropic published research on June 16 showing that what determines whether a person succeeds with an AI coding agent is not whether they can write code but whether they deeply understand the problem they are trying to solve. The study, which analyzed roughly 400,000 Claude Code sessions from approximately 235,000 users between October 2025 and April 2026, documents the largest known dataset on how knowledge workers actually direct AI coding agents — and its central finding challenges a two-year assumption about who these tools are really for.

The study was authored by Zoe Hitzig, Maxim Massenkoff, Eva Lyubich, Shaoyi Zhang, Ryan Heller, and Peter McCrory.

Lawyers Beat the Software Comparison You Expected

On coding tasks, every one of the ten largest occupation groups in the dataset succeeded at a rate within seven percentage points of software engineers. Management occupations ranked highest on verified success — slightly above software engineers — and the gap between software professionals and all other professions, at roughly five percentage points on the strictest success measure, has neither widened nor narrowed over seven months. Lawyers, analysts, life scientists, and arts and media professionals all landed within that narrow band.

The study's own caution is worth noting: its verified-success measure relies partly on explicit confirmation in the transcript, and the research team acknowledged that managers may be more likely to signal when they got what they wanted. That potential measurement artifact does not fully explain the pattern, but readers evaluating the finding should keep it in mind.

Read more: Samsung ChatGPT Enterprise: Codex Reaches Non-Developers in OpenAI's Biggest Korea Rollout

How the Expertise Classifier Works

The study's core mechanism is a privacy-preserving expertise classifier — built on Claude Sonnet 4.6 — that rated each session's user on a five-point scale from novice to expert by reading the session transcript. The classifier did not look at job titles. Instead, it looked for three signals in how the user communicated: how precisely they framed their directions to the agent, what they asked the agent to verify, and whether the user tended to correct Claude or Claude tended to correct the user.

That design makes the expertise measure task-specific rather than credential-specific. A senior software engineer asking their first question about Rust registers as a novice for that session. An accountant who tells the agent exactly which reconciliation rules a Python script must enforce — and then catches the edge case the agent mishandles at month-end close — registers as an expert, regardless of whether they have ever touched Python before. The expertise being measured is knowledge of the problem, not knowledge of the tool.

Every session was classified into one of nine work modes — building new software, fixing broken code, testing and orchestrating pipelines, operating and deploying software, planning, exploring codebases, analyzing data, and writing non-code documents. No individual researcher read any transcript; all classification ran through the privacy-preserving pipeline. Agreement between the transcript classifier and independent telemetry (such as whether code lines were actually added or deleted) exceeded 90%.

The Expert Multiplier in Practice

The expertise signal drives a striking gap in how much work the agent does per human instruction. Expert users trigger an average of 12 Claude actions and approximately 3,200 words of output per prompt. Novice users trigger an average of 5 actions and roughly 600 words — a 2.4x gap in actions and a more than 5x gap in output per unit of user input.

That multiplier holds across every category of work and every band of estimated task value. It appears to reflect that expert users provide clearer context, make better decomposition decisions upfront, and ask the agent to verify the things that actually matter. The agent responds by doing more, and doing more correctly.

Success rates show the same gradient. Novice sessions reached verified success — defined as a combination of the classifier judging the session successful and independent telemetry confirming it with hard evidence, such as passing tests, git commits, or explicit user confirmation — 15% of the time. Intermediate and expert sessions reached verified success between 28% and 33% of the time. The study found that most of the gain is concentrated in the novice-to-intermediate transition; the additional step from intermediate to expert produces a real but smaller improvement.

When sessions ran into trouble — meaning the classifier recorded verified evidence of failure such as errors, failed tests, or retries — the patterns diverged further. Among sessions that hit trouble, expert users recovered and still reached success at a rate of 15%; novice users reached success in only 4% of those struggling sessions. Novice users also abandoned failed sessions at a rate of 19%, compared to 5–7% for everyone else.

What the Division of Labor Actually Looks Like

Across all sessions, users made roughly 70% of planning decisions — what to build, which approach to take, what counts as done — while Claude handled roughly 80% of execution decisions — which files to change, what code to write, which commands to run. In a typical session, users and Claude exchanged about four turns, with each user prompt setting off a chain of around 10 Claude actions and 2,400 words of output. At the upper tail, approximately 2% of sessions averaged more than 100 Claude actions per prompt.

This structural split maps directly onto what the study found about expertise. The human half of the collaboration — deciding what to build and judging whether it succeeded — is where domain knowledge lives. The agent half — implementation, file management, command execution — is where coding skill previously lived. The practical implication is that the scarce input to a successful session is not someone who can write Python. It is someone who can specify what correct Python output looks like.

The Shift in What People Are Using Claude Code For

The seven-month observation window also documented a meaningful change in session composition. The share of sessions spent fixing broken code fell from 33% in October 2025 to 19% by April 2026. In its place, operating software — deploying, configuring, and running systems — grew from 14% to 21% of sessions. Writing and data analysis roughly doubled, from about 10% to 20% of sessions.

The estimated economic value of the average session rose 27% between October and April, measured against a calibration on public freelance marketplace posting rates. The researchers explicitly note these estimates are relative comparison tools rather than literal dollar figures, and that the underlying valuation approach matches sessions to freelance postings at a level of granularity that introduces measurement noise. The direction of the trend — sessions shifting toward higher-complexity, non-debugging work over time — is more reliable than the specific percentage.

Read more: Nadella Names OpenAI and Anthropic: AI Giants Must Earn Societal Permission

Why the Expertise Gap Is Likely to Persist

The expertise advantage documented in this study is not simply about familiarity with the AI tool. The classifier was specifically designed to measure task-specific domain knowledge — the kind that accumulates through years of working in a field — not general AI proficiency. An accountant who has spent a decade catching reconciliation errors brings something to a Claude Code session that a general-purpose programmer without accounting experience cannot replicate by becoming more comfortable with the interface.

This matters for workforce planning in a specific way. Research in cognitive science on expertise development finds that deep domain knowledge is primarily built through accumulated, feedback-rich experience with a field's actual problems — not through tool access or instruction alone. If the Anthropic study's pattern reflects that broader principle, then giving a larger population access to AI coding agents will not eliminate the expertise gap; it will simply reveal it more clearly. Workers who already have deep domain knowledge will see their ability to get work done expand significantly. Workers without it will find the tools less useful, regardless of how comfortable they become with the AI interface.

As the study itself concludes: "the ability to steer Claude toward success comes more from command of a domain than from the ability to write code. A person with such command, in any field, may now be able to do technical work they previously could not. A person without any such expertise will get far less from the same tool."

Do You Need to Code to Use AI Coding Agents?

The Anthropic data suggests the answer is no, with an important caveat. The occupational gap between software professionals and all other fields in the study is five percentage points on strict verified-success measures. That is real but far smaller than conventional intuition would predict.

The caveat: the study covers Claude Code used through Anthropic's own CLI, claude.ai, and the Claude Code desktop app. It explicitly excludes sessions running through third-party developer environments such as Cursor and VS Code integrations, headless automated sessions, and programmatic use through the Agent SDK. The population of users in the study therefore skews toward people who chose to work through Anthropic's own interfaces — a selection effect that may not fully generalize to all AI coding agent deployments.

The study's figures are also vendor-stated and rely on classifier-based success measurement rather than observation of real-world outcomes. The researchers themselves note that they cannot measure whether code written in a session is actually used or produces an economically valuable artifact. Independent replication of the study's methodology had not been published as of June 2026.

For knowledge workers in non-technical fields, the practical takeaway from the data is that the barrier to using AI coding agents productively is lower than assumed — but not zero. Getting to the intermediate expertise level, where the study's data shows most of the benefit is captured, requires understanding your problem domain well, being able to specify what correct output looks like, and knowing when to correct the agent and when to trust it.

Frequently Asked Questions

Does a coding background help when using AI coding agents?

Yes, but less than previously assumed. The Anthropic study found that software engineers reached verified success in about 34% of code-producing sessions, compared to about 29% for non-software professionals — a five-point gap that held steady over the seven-month observation window and never widened. Every one of the ten largest occupation groups in the dataset landed within seven percentage points of software engineers on that measure.

What does the Anthropic Claude Code study say about management professionals?

Management occupations ranked highest on verified success in the study, slightly above software engineers. The Anthropic team noted this may reflect skills that transfer directly to directing an agent — translating goals into requirements, evaluating whether outputs solve the right problem, and managing iterative refinement — or it may partly reflect a measurement artifact, since managers may be more likely to explicitly confirm in the session transcript when they received what they asked for.

Will improving AI coding agents eventually close the expertise gap?

The study's authors explicitly flag this as a key question to monitor. If the returns to domain expertise begin to decrease over time — meaning novice and expert success rates converge — that would suggest AI agents are starting to supply the judgment that users currently bring. As of the April 2026 endpoint of the study, that convergence had not occurred: the gap between novice and intermediate users was still large, and it had not narrowed over seven months of model improvements.

Who is growing fastest among non-software Claude Code users?

According to the Anthropic study, the fastest-growing non-software occupation groups in the dataset during the October 2025–April 2026 window were management, sales, and legal occupations. The study infers occupation from session context signals — file names, document types referenced, vocabulary used — rather than from self-reported information.

Previous page：Mark Zuckerberg wants Meta to launch its own predi...

Next page：After betting the firm on Anthropic, Menlo Venture...

Return to List

Hot Reading

2 day ago

Galaxy S26 One UI 9 Beta 4 Expected June 30: Android 17 Stable Speeds July Release

2 day ago

IBM Quantum Processor Clears Dual Real-World Test: Strong Force and Network Security

2 day ago

Klue hack results in data breach at several cybersecurity firms

2 day ago

JNTC Says It Developed the World First 2.0mm Through-Glass-Via Substrate