Meta's Departing Chief AI Scientist Acknowledges Company's Manipulation of Test Results for Llama 4 Release
2026-01-03 / Read about 0 minute
Author:小编   

In April 2025, Meta unveiled its Llama 4 large - scale model, which came in two variants: Scout and Maverick. The model is touted to employ a Mixture of Experts (MoE) architecture, boasting a parameter count of up to 400 billion. It is designed to support multimodal processing and features a context window that surpasses 10 million tokens.

However, post - release, the model's actual performance left much to be desired, especially in programming tasks. The Maverick version managed to score a mere 16% on the aider multilingual coding benchmark test. This score was significantly lower than anticipated and even lagged behind models with far fewer parameters. Moreover, Llama 4 demonstrated notable deficiencies in areas such as context recall and conversational coherence. There was a substantial chasm between its real - world performance and the claims made in official announcements.

In a more contentious development, internal staff members disclosed that during the training phase, Meta might have incorporated test set data into the training set. This was allegedly done to boost the model's scores on benchmark tests, sparking allegations of 'cheating.' Although Meta officially refuted these accusations, the damage to Llama 4's reputation was already done. This led to the exit of key team members.

This incident not only laid bare the technical shortcomings of Llama 4 but also ignited a flurry of discussions within the open - source AI community. The focus of these discussions centered on the transparency and ethical concerns surrounding model evaluation.