
Facebook debuts its new company brand, Meta, at their headquarters on October 28, 2021 in Menlo Park, California. Meta will focus on ushering in a future of the metaverse and beyond. Kelly Sullivan/Getty Images
Meta's Chief AI Officer Alexandr Wang told company employees on July 2 that the company's next frontier model, codenamed Watermelon, had already matched OpenAI's GPT-5.5 on closely watched AI benchmarks — even while still in training and consuming roughly ten times the computing power of its predecessor. If that claim holds up under independent scrutiny, it would mark the clearest evidence yet that Meta's multi-hundred-billion-dollar AI infrastructure bet is beginning to produce frontier-class results. It may not hold up.
The evaluation Wang cited was vendor-run, internally sourced, and based on benchmarks he did not name. That combination is not a caveat worth noting in passing — it is the central editorial fact. In the world of frontier AI evaluation, unnamed internal benchmarks from the lab that needs a win most are the least reliable category of evidence available. The same model weights can produce scores that swing ten to twenty points depending on which evaluation harness, test set version, and configuration is used. Until Meta publishes a full benchmark table and submits Watermelon to independent evaluation, Wang's claim is a leading indicator about the direction of Meta's compute strategy — not a verified statement about where Watermelon actually sits relative to GPT-5.5.
The context for Wang's announcement is essential. Meta's previous model, Muse Spark — internally codenamed Avocado — launched on April 8, 2026, as the first product from Meta Superintelligence Labs. Independent benchmarking firm Artificial Analysis placed it at 52 on its Intelligence Index — a meaningful recovery from the widely panned Llama 4 release, and enough to put Meta back in the top five models globally. But the model did not surpass OpenAI, Anthropic, or Google at the frontier tier. On the Intelligence Index, GPT-5.5 scored 59, Claude Opus 4.6 scored 53, and Gemini 3.1 Pro scored 57 — all above Muse Spark. In coding benchmarks, Meta's gap was especially visible.
Watermelon is a different proposition in scale, if not yet in demonstrated result. Wang described it during the town hall as using "an order of magnitude more compute" than Muse Spark, referencing the training run on Meta's Prometheus cluster — a 1-gigawatt computing facility under construction in New Albany, Ohio, estimated at approximately 500,000 GPUs and drawing over a gigawatt of power. That makes it one of the largest AI training installations ever attempted by a single company. The company has guided investors to expect between $125 billion and $145 billion in AI capital expenditure this year — a figure that reframes what "more compute" means at Meta's scale.
Read more: Meta Muse Spark Model Debuts Reasoning Capabilities for Meta AI in Superintelligence Labs First
The engineering logic behind Watermelon's training run is governed by what researchers call neural scaling laws — empirical power-law relationships between training compute and model performance first formalized by OpenAI researchers in 2020 and later refined by DeepMind's Chinchilla work in 2022. These laws established that model performance improves predictably as compute, data, and parameters increase — but with a critical constraint: the improvement is logarithmic, not linear. Each doubling of training compute produces progressively smaller gains. A tenfold increase in compute does not yield a tenfold better model; under scaling law mathematics, it typically yields a 30 to 40 percent reduction in training loss, depending on the model's starting point and data quality.
The Chinchilla formulation adds a further constraint that bears directly on Watermelon's training run: for a compute-optimal result, the number of training tokens must scale proportionally with the number of parameters. A 10× compute increase that simply runs the same model longer on the same data is less efficient than one that also scales the training dataset proportionally. This is widely understood to be one reason Meta has reportedly incorporated proprietary data from its own platforms — Facebook, Instagram, WhatsApp, and Threads — into the Watermelon training corpus. At Meta's scale, that corpus represents one of the few genuinely differentiated training assets available to any AI lab: billions of social interactions, conversational exchanges, and real-world queries that no competitor can replicate from public web data alone.
The practical ceiling of this approach is also well-documented. Research published in 2025 and 2026 on frontier model scaling shows that MMLU — one of the most commonly cited AI knowledge benchmarks — is functionally saturated at the frontier, with independent audits finding over 45 percent overlap between popular training corpora and MMLU test questions. Models that have seen benchmark questions during training score artificially high without demonstrating the underlying knowledge being measured. Wang did not identify which benchmarks Watermelon was evaluated against — meaning it is impossible to determine, from the available evidence, whether the claimed parity reflects genuine capability or an artifact of training data overlap.
The absence of named benchmarks is the most technically significant detail in Wang's announcement. In the 2026 frontier AI evaluation landscape, benchmark governance type is the single best predictor of how a published score can mislead. Scores produced by independent, third-party evaluation organizations using a public harness and reproducible methodology — such as Artificial Analysis or the Scale AI SEAL leaderboard — carry the most weight. Scores produced by the company whose model is being evaluated, using an internal test setup with no public methodology, carry the least — not because of dishonesty, but because structural incentives of selective reporting and confirmation bias operate even on rigorous researchers.
Wang's claim belongs to the least reliable category. Meta has not confirmed which benchmarks Watermelon was evaluated on, nor made the evaluation harness or configuration public. The company declined to comment when approached by multiple outlets following the Business Insider report, and OpenAI did not respond to a request for comment. This does not mean Wang's claim is false. It means it cannot be evaluated by anyone outside Meta until the company publishes the full benchmark table. The benchmark to watch is not the one Wang cited in the town hall. It is whether, when Watermelon ships, Meta submits the model to a named independent evaluation organization with a public harness — and whether the numbers hold there.
Even if Watermelon's internal benchmark scores survive contact with independent evaluation, the competitive landscape has already moved past the reference point Wang used. OpenAI released GPT-5.5 in April 2026, making it broadly available to ChatGPT and API users across paid tiers. By late June, OpenAI had previewed GPT-5.6 — a three-model lineup (Sol, Terra, and Luna) whose general release has been restricted at the request of the Trump administration over national security concerns. GPT-5.6 is currently available only to approximately 20 government-approved partner organizations, with OpenAI describing the arrangement as a short-term step toward broader access.
The comparison Wang chose — matching GPT-5.5 — is an intentional framing decision. GPT-5.5 is OpenAI's broadly available commercial benchmark, the model developers and enterprise buyers actually use and build against. Matching it is a meaningful milestone if it holds. But it also means Watermelon, when released, will enter a market where OpenAI's publicly accessible frontier has already advanced, and where GPT-5.6 Sol — described by OpenAI as its strongest model yet — exists behind a restricted deployment gate that the broader market cannot currently reach either.
Read more: Meta AI Agents Behind Schedule: Zuckerberg Tells Staff $145B Bet Hasn't Delivered
For developers and enterprise AI buyers, the practical significance of Wang's announcement depends on a chain of unconfirmed steps: Watermelon must ship publicly, its benchmark scores must be independently verified, and the model must perform comparably on the tasks that actual production workloads require — not just on the standardized tests that model launches are optimized around. A model that scores well on knowledge benchmarks may still struggle on long-context reasoning, tool-use reliability under sustained load, or agentic multi-step task completion — the categories where enterprise deployments most commonly fail.
If Watermelon ships and independently verified results confirm parity with GPT-5.5, the procurement calculus changes for organizations currently locked into a two-vendor frontier defined by OpenAI and Anthropic. Meta has committed to making future versions of the Muse series available as open-weight models, which would give organizations running on-premises infrastructure a frontier-capable option they cannot currently access at this tier. Wang also posted publicly on X on July 3 that a near-term Muse Spark update will deliver major improvements in coding and agentic capabilities — a signal that Meta is not waiting for Watermelon to ship before attempting to close the gap with Anthropic's Claude Opus in coding performance.
The more fundamental issue, as Zuckerberg acknowledged in the same July 2 town hall, is that Meta's AI investments have not translated into results as quickly as he anticipated. That admission, reported by Reuters, sits in tension with Wang's bullish benchmark claim — and the tension is instructive. Meta has the compute, the data, the capital, and now the talent. Watermelon is the next major test of whether those ingredients add up to a model that can compete at the frontier on a level playing field that anyone, and not just Wang's town hall audience, can verify.
When will Meta release the Watermelon AI model publicly?
Meta has not set a public release date for Watermelon. The model remains in training as of July 4, 2026. Wang described it as the successor to Muse Spark but provided no launch timeline during the July 2 internal town hall. There is no public beta or developer preview access currently available.
Is Meta Watermelon actually better than GPT-5.5?
That question cannot be answered yet. Wang claimed internal benchmark parity during a private company meeting, citing unnamed evaluations run on an undisclosed configuration. No independent evaluation, published benchmark table, or model card has been released. Until Meta publishes the model and submits it to named third-party testers, the claim is an internal assertion that cannot be confirmed or disputed from outside.
What is the technical difference between Watermelon and Muse Spark?
Watermelon reportedly uses an order of magnitude more training compute than Muse Spark — roughly a tenfold increase — drawing on Meta's Prometheus computing cluster in Ohio, estimated at approximately 500,000 GPUs. Under neural scaling laws, a tenfold compute increase produces real but diminishing gains in model quality, not a proportional tenfold improvement. Muse Spark was designed to be compact and efficient; Watermelon appears to prioritize raw capability by scaling training resources to a level that only a handful of organizations worldwide can match.
What should enterprise AI buyers do with this information?
Treat Wang's claim as a directional signal rather than a procurement trigger. The frontier AI market may become meaningfully more competitive if Watermelon delivers on its internal benchmarks, introducing new pricing pressure and an open-weight alternative at the tier currently occupied only by GPT-5.5, Claude Opus 4.6, and Gemini 3.1 Pro. The benchmark that matters is not the one Wang cited in the town hall — it is whether Meta publishes a full, reproducible evaluation table and submits Watermelon to independent testing from a named organization upon release.
