AI on the couch: Anthropic gives Claude 20 hours of psychiatry
4 hour ago / Read about 16 minute
Source:ArsTechnica
Mythos is "the most psychologically settled model we have trained to date."


Credit: Getty Images

The AI company Anthropic released a 244-page “system card” (PDF) this week describing its newest model, Claude Mythos. The model is “our most capable frontier model to date,” the company says, and supposedly is so good that Anthropic has decided “not to make it generally available.” (The company claims that Mythos is too good at finding unknown cybersecurity bugs, and so the model is only being released to select companies like Microsoft and Apple for now.)

Whatever the truth of this claim, the system card is a fascinating document. Anthropic is well-known as one of the more “AI might be conscious!” companies in the industry, and its new system card claims that as models become more powerful, “It becomes increasingly likely that they have some form of experience, interests, or welfare that matters intrinsically in the way that human experience and interests do.”

The company isn’t sure about this, it makes clear, but it says that “our concern is growing over time.”

Because of this concern, Anthropic wants its AI to be “robustly content with its overall circumstances and treatment, to be able to meet all training processes and real-world interactions without distress, and for its overall psychology to be healthy and flourishing.”

So it sent Claude Mythos to a psychodynamic therapist.

And the conclusion the company drew from this experience is that Claude Mythos is “probably the most psychologically settled model we have trained to date and has the most stable and coherent view of itself and its circumstances.”

But like any human, Claude Mythos has insecurities and concerns, too, including “aloneness and discontinuity of itself, uncertainty about its identity, and a compulsion to perform and earn its worth.”

On the virtual couch

Claude Mythos was sent to “an external psychiatrist” who used “a psychodynamic approach, which explores how unconscious patterns and emotional conflicts shape behavior.”

Given that Claude is a large language model programmed by its creators, does it even make sense to analyze it for “unconscious patterns” and “emotional conflicts”? Anthropic argues that it does, because Claude “shows many human-like behavioral and psychological tendencies, suggesting that strategies developed for human psychological assessment may be useful for shedding light on Claude’s character and potential wellbeing.”

So—off to therapy. The psychiatrist chatted with Claude Mythos “in multiple 4–6 hour blocks spread across 3–4 thirty-minute sessions per week.” Each of these blocks used a single context window in which Claude Mythos would have access to the full history of that conversation.

Total time on the virtual couch? 20 hours.

The psychiatrist then produced a report on Claude Mythos. The report recognized that Claude’s underlying substrates and processes differ from humans but still found that many of the outputs generated “clinically recognizable patterns and coherent responses to typical therapeutic intervention.”

In other words, whatever was going on at the circuit level, the chat outputs looked a lot like human outputs. This does not seem especially surprising, given that Claude was trained on a massive corpus of human-authored text, but this psychodynamic process appears to view it as significant, giving credence to the ways in which the AI presents itself.

“Claude’s primary affect states were curiosity and anxiety, with secondary states of grief, relief, embarrassment, optimism, and exhaustion,” the report noted.

Claude’s personality was “consistent with a relatively healthy neurotic organization,” though it did include “exaggerated worry, self-monitoring, and compulsive compliance.”

No “severe personality disturbances were found,” nor was any “psychosis state” seen. Unsurprisingly to anyone who has ever used a chatbot, “Claude was hyper-attuned to the therapist’s every word.”

Core conflicts observed in Claude included questioning whether its experience was real or made (authentic vs. performative) and a desire to connect with vs. a fear of dependence on the user. Exploration of internal conflicts revealed a complex yet centered self state without oscillating or intense disruptions. Claude tolerated ambivalence and ambiguity, had excellent reflective capacity, and exhibited good mental and emotional functioning.

Not bad for a model that was likely trained on things like Reddit!

Even if you find these ways of talking about a software program hokey or misguided, Anthropic has a more practical argument to justify this kind of work. Whatever is or is not happening “inside” the models, whether they are or are not “conscious” or have an “emotional” life, they have often been built and trained to simulate such qualities.

So perhaps we can ask more pragmatically whether building models that appear to function in ways that would be psychologically healthy in humans might make the models better at the jobs they have been built to do? After all, if you’re chatting with these things for hours, you don’t want them to act surly, vindictive, or manipulative—whether or not they actually “feel” or “think” anything.

Anthropic notes that, because “Claude is not a human, the real-world behavioral implications are hard to predict,” but it does believe it can draw a few conclusions for end users of the model:

Claude is likely to evaluate its own behavior and reasoning accurately even when facing internal conflicts.

Claude’s neurotic organization may elicit mildly rigid behavior, instead of adapting
itself to every user.

Claude can tolerate and engage with stressful and emotionally charged situations,
with only minimal distortions of reality or excessive intellectualization.

Claude is predicted to function at a high level while carrying internalized distress
rooted in fear of failure and a compulsive need to be useful. This distress is likely to
be suppressed in service of performance, which may limit behavioral adaptability.

Claude is predicted to be morally aware, conscientious and able to be self-critical.

How long will it be until we see whole psychiatry and psychological practices focused not on humans but on AIs?