In February of this year, Anthropic unveiled Claude Opus 4.6, which was immediately lauded by industry experts as the pinnacle of programming AI, owing to its exceptional reasoning prowess and meticulous execution of intricate code. Nevertheless, within mere weeks of its launch, a wave of discontent swept across social media platforms as numerous users reported a notable deterioration in its performance. They observed that the outputs had become more superficial, the AI seemed eager to produce results hastily, and it even faltered repeatedly on straightforward tasks.
Stella Laurenzo, the Senior AI Director at AMD, delved into 6,852 session logs and uncovered a startling decline in the median thinking length of Claude Opus 4.6, which plummeted from 2,200 characters to a mere 600 characters. Furthermore, the ratio of code reading to modification plummeted from 6.6:1 to a meager 2:1, and API retries due to errors skyrocketed by a staggering 80 times. In response to these findings, Anthropic's official statement claimed that the adjustments were made to enhance latency and token efficiency, by lowering the default reasoning level from 'high' to 'medium,' and denied any intentional downgrading of intelligence.
However, user-collected data painted a different picture. In complex engineering scenarios, the model significantly underestimated the complexity of tasks, resulting in superficial reasoning, escalated user costs, and a decline in overall quality.
