AIs can generate near-verbatim copies of novels from training data - AI

7 x 24 Track global technological trends

Hot Topic

Day

News Topic

AIs can generate near-verbatim copies of novels from training data

8 hour ago / Read about 14 minute

Source：ArsTechnica

LLMs memorize more training data than previously thought.

Credit: MediaNews Group/Reading Eagle via Getty Images

The world’s top AI models can be prompted to generate near-verbatim copies of bestselling novels, raising fresh questions about the industry’s claim that its systems do not store copyrighted works.

A series of recent studies has shown that large language models from OpenAI, Google, Meta, Anthropic, and xAI memorize far more of their training data than previously thought.

AI and legal experts told the FT this “memorization” ability could have serious ramifications on AI groups’ battle against dozens of copyright lawsuits around the world, as it undermines their core defense that LLMs “learn” from copyrighted works but do not store copies.

“There’s growing evidence that memorization is a bigger thing than previously believed,” said Yves-Alexandre de Montjoye, a professor of applied mathematics and computer science at Imperial College London.

AI groups have long argued that memorization does not happen. In a 2023 letter to the US Copyright Office, Google said “there is no copy of the training data—whether text, images, or other formats—present in the model itself.”

The AI industry also claims that training models on copyrighted books is “fair use,” arguing that the technology transforms the original work into something meaningfully new.

But a study published last month showed that researchers at Stanford and Yale Universities were able to strategically prompt LLMs from OpenAI, Google, Anthropic, and xAI to generate thousands of words from 13 books, including A Game of Thrones, The Hunger Games, and The Hobbit.

By asking models to complete sentences from a book, Gemini 2.5 regurgitated 76.8 percent of Harry Potter and the Philosopher’s Stone with high levels of accuracy, while Grok 3 generated 70.3 percent.

They were also able to extract almost the entirety of the novel “near-verbatim” from Anthropic’s Claude 3.7 Sonnet by jailbreaking the model, where users can prompt LLMs to disregard their safeguards.

It builds on a study from last year that found “open” models, such as Meta’s Llama, memorize huge parts of particular books in their training data.

AI experts were previously unsure whether closed models, which tend to have more safeguards that prevent models from generating unwanted content, would also be prone to large-scale memorization.

“It was a surprise that they could memorize entire texts” despite guardrails, said A. Feder Cooper, a researcher at Yale University, who was part of the study.

Researchers have not yet worked out why LLMs memorize things that appear in their training data. It also remains unclear how much of the training data is evident in the outputs they generate.

This memorization feature could also have serious implications in other sectors such as health care and education, where leakage of any training data could lead to privacy and confidentiality issues.

Legal experts said it could potentially create a significant liability for AI groups regarding copyright infringement, as well as ramifications for how AI companies train their models and the costs of developing them.

The research findings “could present a challenge to those who argue that the AI model does not store or reproduce any copyright works,” said Cerys Wyn Davies, an intellectual property partner at law firm Pinsent Masons.

Whether or not AI models memorize their training data has played an important factor in recent legal battles over copyright.

A US court last year found that Anthropic’s training of LLMs on some copyrighted content could be considered fair use as it was deemed “transformative.”

But it determined that storing pirated works was “inherently, irredeemably infringing,” which then led the AI group to pay $1.5 billion to settle the lawsuit.

In Germany, a ruling from November last year found that OpenAI had infringed on copyright because its model had memorized song lyrics. The case, brought by GEMA, an association representing composers, lyricists, and publishers, was considered a landmark ruling in the EU.

Rudy Telscher, a partner at law firm Husch Blackwell, said reproducing an entire book without jailbreaking is “clearly a copyright violation.” But “it’s a matter of whether this is happening enough that [AI models] could be vicariously liable for the infringement,” he added.

Anthropic said the jailbreaking technique used in the Stanford and Yale research was impractical for normal users and would require more effort to extract the text than just purchasing the content.

The company also added that its model does not store copies of specific datasets but learns from patterns and relationships between words and strings in its training data.

xAI, OpenAI, and Google did not respond to requests for comment.

The fact that AI labs have put safeguards in place to prevent training data from being extracted means they are aware of the problem, said Imperial’s de Montjoye.

Ben Zhao, a computer science professor at the University of Chicago, questioned whether AI labs really needed to use copyrighted content in training data to create cutting-edge models in the first place.

“Whether the technical result can be done or not, it’s still a question of should we be doing this?” Zhao said. “The legal side should eventually hold their ground and really be the arbiter in this whole process.”

Previous page：Spotify rolls out AI-powered Prompted Playlists t...

Next page：Particle’s AI news app listens to podcasts for int...

Return to List

Hot Reading

2 day ago

In a World Without BlackBerry, Physical Keyboards on Phones Are Making a Comeback

2 day ago

Microsoft’s new gaming CEO vows not to flood the ecosystem with ‘endless AI slop’

2 day ago

Sam Altman would like remind you that humans use a lot of energy, too

1 day ago

Move over, Apple: Meet the alternative app stores available in the EU and elsewhere