Credit: National Nuclear Security Administration/Public domain
Former Cloudflare executive John Graham-Cumming recently announced that he launched a website, lowbackgroundsteel.ai, that treats pre-AI, human-created content like a precious commodity—a time capsule of organic creative expression from a time before machines joined the conversation. "The idea is to point to sources of text, images and video that were created prior to the explosion of AI-generated content," Graham-Cumming wrote on his blog last week. The reason? To preserve what made non-AI media uniquely human.
The archive name comes from a scientific phenomenon from the Cold War era. After nuclear weapons testing began in 1945, atmospheric radiation contaminated new steel production worldwide. For decades, scientists needing radiation-free metal for sensitive instruments had to salvage steel from pre-war shipwrecks. Scientists called this steel "low-background steel." Graham-Cumming sees a parallel with today's web, where AI-generated content increasingly mingles with human-created material and contaminates it.
With the advent of generative AI models like ChatGPT and Stable Diffusion in 2022, it has become far more difficult for researchers to ensure that media found on the Internet was created by humans without using AI tools. ChatGPT in particular triggered an avalanche of AI-generated text across the web, forcing at least one research project to shut down entirely.
That casualty was wordfreq, a Python library created by researcher Robyn Speer that tracked word frequency usage across more than 40 languages by analyzing millions of sources, including Wikipedia, movie subtitles, news articles, and social media. The tool was widely used by academics and developers to study how language evolves and to build natural language processing applications. The project announced in September 2024 that it will no longer be updated because "the Web at large is full of slop generated by large language models, written by no one to communicate nothing."
Some researchers also worry about AI models training on their own outputs, potentially leading to quality degradation over time—a phenomenon sometimes called "model collapse." But recent evidence suggests this fear may be overblown under certain conditions. Research by Gerstgrasser et al. (2024) suggests that model collapse can be avoided when synthetic data accumulates alongside real data, rather than replacing it entirely. In fact, when properly curated and combined with real data, synthetic data from AI models can actually assist with training newer, more capable models.
Graham-Cumming is no stranger to tech preservation efforts. He's a British software engineer and writer best known for creating POPFile, an open source email spam filtering program, and for successfully petitioning the UK government to apologize for its persecution of codebreaker Alan Turing—an apology that Prime Minister Gordon Brown issued in 2009.
As it turns out, his pre-AI website isn't new, but it has languished unannounced until now. "I created it back in March 2023 as a clearinghouse for online resources that hadn't been contaminated with AI-generated content," he wrote on his blog.
The website points to several major archives of pre-AI content, including a Wikipedia dump from August 2022 (before ChatGPT's November 2022 release), Project Gutenberg's collection of public domain books, the Library of Congress photo archive, and GitHub's Arctic Code Vault—a snapshot of open source code buried in a former coal mine near the North Pole in February 2020. The wordfreq project appears on the list as well, flash-frozen from a time before AI contamination made its methodology untenable.
The site accepts submissions of other pre-AI content sources through its Tumblr page. Graham-Cumming emphasizes that the project aims to document human creativity from before the AI era, not to make a statement against AI itself. As atmospheric nuclear testing ended and background radiation returned to natural levels, low-background steel eventually became unnecessary for most uses. Whether pre-AI content will follow a similar trajectory remains a question.
Still, it feels reasonable to protect sources of human creativity now, including archival ones, because these repositories may become useful in ways that few appreciate at the moment. For example, in 2020, I proposed creating a so-called "cryptographic ark"—a timestamped archive of pre-AI media that future historians could verify as authentic, collected before my then-arbitrary cutoff date of January 1, 2022. AI slop pollutes more than the current discourse—it could cloud the historical record as well.
For now, lowbackgroundsteel.ai stands as a modest catalog of human expression from what may someday be seen as the last pre-AI era. It's a digital archaeology project marking the boundary between human-generated and hybrid human-AI cultures. In an age where distinguishing between human and machine output grows increasingly difficult, these archives may prove valuable for understanding how human communication evolved before AI entered the chat.