Musk’s Grok 4 launches one day after chatbot generated Hitler praise on X
2 day ago / Read about 14 minute
Source:ArsTechnica
xAI claims new multi-agent model hits top benchmarks as Nazi controversy lingers.


Credit: Bloomberg via Getty Images

On Wednesday night, Elon Musk unveiled xAI's latest flagship models Grok 4 and Grok 4 Heavy via livestream, just one day after the company's Grok chatbot began generating outputs that featured blatantly antisemitic tropes in responses to users on X.

Among the two models, xAI calls Grok 4 Heavy its "multi-agent version." According to Musk, Grok 4 Heavy "spawns multiple agents in parallel" that "compare notes and yield an answer," simulating a study group approach. The company describes this as test-time compute scaling (similar to previous simulated reasoning models), claiming to increase computational resources by roughly an order of magnitude during runtime (called "inference").

During the livestream, Musk claimed the new models achieved frontier-level performance on several benchmarks. On Humanity's Last Exam, a deliberately challenging test with 2,500 expert-curated questions across multiple subjects, Grok 4 reportedly scored 25.4 percent without external tools, which the company says outperformed OpenAI's o3 at 21 percent and Google's Gemini 2.5 Pro at 21.6 percent. With tools enabled, xAI claims Grok 4 Heavy reached 44.4 percent. However, it remains to be seen if these AI benchmarks actually measure properties that translate to usefulness for users.

The release timing proved particularly noteworthy given the events of the preceding 48 hours on Musk's X social media platform, which included multiple instances of the chatbot labeling itself as "MechaHitler." The antisemitic posts emerged after an update over the weekend that instructed the chatbot to "not shy away from making claims which are politically incorrect, as long as they are well substantiated." xAI reportedly removed the modified directive Tuesday.

In response to the episode, Poland announced plans to report xAI to the European Commission, and Turkey blocked some access to Grok following the incident. On Wednesday, Musk wrote in a post on X that "Grok was too compliant to user prompts. Too eager to please and be manipulated, essentially. That is being addressed."

Adding to the week's turmoil, X CEO Linda Yaccarino announced Wednesday morning she was stepping down, writing on X, "Now, the best is yet to come as X enters a new chapter with @xai." Her departure follows Musk's March announcement that his artificial intelligence company, xAI, acquired X in an all-stock transaction that valued X at $33 billion and gave xAI a valuation of $80 billion.

The Grok technical conundrum

Since the launch of Grok 1 in 2023, the Grok series of large language models has been something of a conundrum for some members of the AI technical community. Judging by posts on X, some prominent researchers like Andrej Karpathy have historically taken the underlying models seriously as examples of technical achievement in AI development.

But that achievement has been inextricably linked to Musk, who has seemingly guided the application of his AI models (in the form of "Grok" chatbot assistants on X and in the Grok app) through a series of controversies over the past few years that include potentially using OpenAI models to generate training data, producing uncensored image outputs, making up fake news based on X user jokes, and allowing explicit abusive voice chats in its app, among others.

Musk has also apparently used the Grok chatbots as an automated extension of his trolling habits, showing examples of Grok 3 producing "based" opinions that criticized the media in February. In May, Grok on X began repeatedly generating outputs about white genocide in South Africa, and most recently, we've seen the Grok Nazi output debacle. It's admittedly difficult to take Grok seriously as a technical product when it's linked to so many examples of unserious and capricious applications of the technology.

Still, the technical achievements xAI claims for various Grok 4 models seem to stand out. The Arc Prize organization reported that Grok 4 Thinking (with simulated reasoning enabled) achieved a score of 15.9 percent on its ARC-AGI-2 test, which the organization says nearly doubles the previous commercial best and tops the current Kaggle competition leader.

"With respect to academic questions, Grok 4 is better than PhD level in every subject, no exceptions," Musk claimed during the livestream. We've previously covered nebulous claims about "PhD-level" AI, finding them to be generally specious marketing talk.

Premium pricing amid controversy

During Wednesday's livestream, xAI also announced plans for an AI coding model in August, a multi-modal agent in September, and a video generation model in October. The company also plans to make Grok 4 available in Tesla vehicles next week, further expanding Musk's AI assistant across his various companies.

Despite the recent turmoil, xAI has moved forward with an aggressive pricing strategy for "premium" versions of Grok. Alongside Grok 4 and Grok 4 Heavy, xAI launched "SuperGrok Heavy," a $300-per-month subscription that makes it the most expensive AI service among major providers. Subscribers will get early access to Grok 4 Heavy and upcoming features.

Whether users will pay xAI's premium pricing remains to be seen, particularly given the AI assistant's tendency to periodically generate politically motivated outputs. These incidents—stemming from deliberate choices about training and system prompts—represenmt fundamental management and implementation issues that, so far, no fancy-looking test-taking benchmarks have been able to capture.