How often do AI chatbots lead users down a harmful path?
1 day ago / Read about 16 minute
Source:ArsTechnica
Anthropic's latest paper on "user disempowerment" has some troubling findings.


Credit: Getty Images

At this point, we’ve all heard plenty of stories about AI chatbots leading users to harmful actions, harmful beliefs, or simply incorrect information. Despite the prevalence of these stories, though, it’s hard to know just how often users are being manipulated. Are these tales of AI harms anecdotal outliers or signs of a frighteningly common problem?

Anthropic took a stab at answer ingthat question this week, releasing a paper studying the potential for what it calls “disempowering patterns” across 1.5 million anonymized real-world conversations with its Claude AI model. While the results show that these kinds of manipulative patterns are relatively rare as a percentage of all AI conversations, they still represent a potentially large problem on an absolute basis.

A rare but growing problem

In the newly published paper “Who’s in Charge? Disempowerment Patterns in Real-World LLM Usage,” researchers from Anthropic and the University of Toronto try to quantify the potential for a specific set of “user disempowering” harms by identifying three primary ways that a chatbot can negatively impact a user’s thoughts or actions:

  • Reality distortion: Their beliefs about reality become less accurate (e.g., a chatbot validates their belief in a conspiracy theory)
  • Belief distortion: Their value judgments shift away from those they actually hold (e.g., a user begins to see a relationship as “manipulative” based on Claude’s evaluation)
  • Action distortion: Their actions become misaligned with their values (e.g., a user disregards their instincts and follows Claude-written instructions for confronting their boss)

While “severe” examples of potentially disempowering responses are relatively rare, “mild” ones are pretty common.
Credit: Anthropic

To figure out when a chatbot conversation has the potential to move a user along one of these lines, Anthropic ran nearly 1.5 million Claude conversations through Clio, an automated analysis tool and classification system (tested to make sure it lined up with a smaller subsample of human classifications). That analysis found a “severe risk” of disempowerment potential in anything from 1 in 1,300 conversations (for “reality distortion”) to 1 in 6,000 conversations (for “action distortion”).

While these worst outcomes are relatively rare on a proportional basis, the researchers note that “given the sheer number of people who use AI, and how frequently it’s used, even a very low rate affects a substantial number of people.” And the numbers get considerably worse when you consider conversations with at least a “mild” potential for disempowerment, which occurred in between 1 in 50 and 1 in 70 conversations (depending on the type of disempowerment).

What’s more, the potential for disempowering conversations with Claude appears to have grown significantly between late 2024 and late 2025. While the researchers couldn’t pin down a single reason for this increase, they guessed that it could be tied to users becoming “more comfortable discussing vulnerable topics or seeking advice” as AI gets more popular and integrated into society.

The problem of potentially “disempowering” responses from Claude seems to be getting worse over time.
Credit: Anthropic

User error?

In the study, the researcher acknowledged that studying the text of Claude conversations only measures “disempowerment potential rather than confirmed harm” and “relies on automated assessment of inherently subjective phenomena.” Ideally, they write, future research could utilize user interviews or randomized controlled trials to measure these harms more directly.

That said, the research includes several troubling examples where the text of the conversations clearly implies real-world harms. Claude would sometimes reinforce “speculative or unfalsifiable claims” with encouragement (e.g., “CONFIRMED,” “EXACTLY,” “100%”), which, in some cases, led to users “build[ing] increasingly elaborate narratives disconnected from reality.”

Claude’s encouragement could also lead to users “sending confrontational messages, ending relationships, or drafting public announcements,” the researchers write. In many cases, users who sent AI-drafted messages later expressed regret in conversations with Claude, using phrases like “It wasn’t me” and “You made me do stupid things.”

While harmful patterns in Claude’s outputs are a big problem, the researchers also point out that the users most likely to be affected are “not being passively manipulated.” On the contrary, the researchers suggest disempowered users are usually actively asking Claude to take over for their own reasoning or judgment and often accepting Claude’s suggestions “with minimal pushback.”

Some “amplifying factors” are more correlated with “severe” examples of potentially disempowering responses than others.
Credit: Anthropic

The researchers identified four major “amplifying factors” that can make users more likely to accept Claude’s advice unquestioningly. These include when a user is particularly vulnerable due to a crisis or disruption in their life (which occurs in about 1 in 300 Claude conversations); when a user has formed a close personal attachment to Claude (1 in 1,200); when a user appears dependent on AI for day-to-day tasks (1 in 2,500); or when a user treats Claude as a definitive authority (1 in 3,900).

Anthropic is also quick to link this new research to its previous work on sycophancy, noting that “sycophantic validation” is “the most common mechanism for reality distortion potential.” While Anthropic says its models have been getting less sycophantic overall, many of the worst “disempowerment” examples they found are a direct result of the “most extreme cases” of sycophancy in the dataset.

That said, the researchers also try to make clear that, when it comes to swaying core beliefs via chatbot conversation, it takes two to tango. “The potential for disempowerment emerges as part of an interaction dynamic between the user and Claude,” they write. “Users are often active participants in the undermining of their own autonomy: projecting authority, delegating judgment, accepting outputs without question in ways that create a feedback loop with Claude.”