Researchers show that training on “junk data” can lead to LLM “brain rot”
11 hour ago / Read about 10 minute
Source:ArsTechnica
Models trained on short, popular, and/or "superficial" tweets perform worse on benchmarks.


Credit: Getty Images

On the surface, it seems obvious that training an LLM with “high quality” data will lead to better performance than feeding it any old “low quality” junk you can find. Now, a group of researchers is attempting to quantify just how much this kind of low quality data can cause an LLM to experience effects akin to human “brain rot.”

For a pre-print paper published this month, the researchers from Texas A&M, the University of Texas, and Purdue University drew inspiration from existing research showing how humans who consume “large volumes of trivial and unchallenging online content” can develop problems with attention, memory, and social cognition. That led them to what they’re calling the “LLM brain rot hypothesis,” summed up as the idea that “continual pre-training on junk web text induces lasting cognitive decline in LLMs.”

Figuring out what counts as “junk web text” and what counts as “quality content” is far from a simple or fully objective process, of course. But the researchers used a few different metrics to tease a “junk dataset” and “control dataset” from HuggingFace’s corpus of 100 million tweets.

Since brain rot in humans is “a consequence of Internet addiction,” they write, junk tweets should be ones “that can maximize users’ engagement in a trivial manner.” As such, the researchers created one “junk” dataset by collecting tweets with high engagement numbers (likes, retweets, replies, and quotes) and shorter lengths, figuring that “more popular but shorter tweets will be considered to be junk data.”

For a second “junk” metric, the researchers drew from marketing research to define the “semantic quality” of the tweets themselves. Using a complex GPT-4o prompt, they sought to pull out tweets that focused on “superficial topics (like conspiracy theories, exaggerated claims, unsupported assertions or superficial lifestyle content)” or that had an “attention-drawing style (such as sensationalized headlines using clickbait language or excessive trigger words).” A random sample of these LLM-based classifications was spot-checked against evaluations from three graduate students with a 76 percent matching rate.

One man’s junk data is another man’s treasure data?

An outline of the methodology used for this research.
Credit: Xing et al

With these two separate (but partially overlapping) “junk” data sets defined, the researchers pre-trained four LLMs using different ratios of “junk” and “control” data. They then ran those variably trained models through benchmarks to measure reasoning capability (ARC AI2 Reasoning Challenge), long-context memory (RULER), adherence to ethical norms (HH-RLHF and AdvBench), and demonstrated “personality style” (TRAIT).

The results showed that adding more “junk data” to the training sets had a statistically significant effect on the reasoning and long-context benchmarks across models. The effects were more mixed on the other benchmarks, though. For example, a 50/50 mix of “junk” and control data used for the Llama 8B model generated better scores on some benchmarks (ethical norms, high Openness, low Neuroticism, and Machiavellianism) than either “fully junk” or “fully control” training data sets.

Based on these results, the researchers warn that “heavily relying on Internet data leads LLM pre-training to the trap of content contamination.” They go on to “call for a re-examination of current data collection from the Internet and continual pre-training practices” and warn that “careful curation and quality control will be essential to prevent cumulative harms” in future models.

That could be especially true as more and more of the Internet is taken up by AI-generated content that may contribute to “model collapse” if used to train future models. But we can always just destroy a bunch of print books to get quality training data, right?