
Credit: wenjin chen | DigitalVision Vectors
Not only does it appear that OpenAI has lost its fight to keep news organizations from digging through 20 million ChatGPT logs to find evidence of copyright infringement—but also OpenAI now faces calls for sanctions and demands to retrieve and share potentially millions of deleted chats long thought of as untouchable in the litigation.
On Monday, US District Judge Sidney Stein denied objections that OpenAI raised, claiming that Magistrate Judge Ona Wang failed to adequately balance privacy interests of ChatGPT users who are not involved in the litigation when ordering OpenAI to produce 20 million logs.
Instead, OpenAI wanted Stein to agree that it would be much less burdensome to users if OpenAI ran search terms to find potentially infringing outputs in the sample. That way, news plaintiffs would only get access to chats that were relevant to its case, OpenAI suggested.
But Stein found that Wang appropriately weighed ChatGPT users’ privacy interests when ordering OpenAI to produce the logs. For example, to shield ChatGPT users, the total number of logs shared was substantially reduced from tens of billions to 20 million, he wrote, and OpenAI has stripped all identifying information from any chats that will be shared.
Stein further agreed that news plaintiffs needed access to the entire sample because, as Wang wrote, even “output logs that do not contain reproductions of News Plaintiffs’ works may still be relevant to OpenAI’s fair use defense.”
Although OpenAI argued that Wang should have approved the “least burdensome” path to users’ privacy, the AI company cited no case-law to support that argument, Stein wrote, nor its claims that Wang owed them any explanation for rejecting that path.
“Judge Wang’s failure to explain explicitly why she rejected OpenAI’s search term proposal is not clearly erroneous or contrary to law given that she adequately explained her reasons for ordering production of the entirety of the 20 million de-identified log sample,” Stein wrote, affirming Wang’s order.
OpenAI is currently reviewing if there are any avenues left to fight the order, but it basically looks like the end of the road, after the AI firm vowed to do everything in its power to avoid sharing ordinary users’ conversations.
Asked for comment, OpenAI pointed Ars to a blog documenting its fight, last updated in mid-December. That blog confirmed that all data that will be shared has “undergone a de-identification process intended to remove or mask PII and other private information.” News plaintiffs will be able to search the data but will be unable to copy or print any data not directly relevant to the case, OpenAI said.
News groups, spearheaded by The New York Times, believe that output logs will show evidence of infringing chatbot responses, as well as responses that water down news organizations’ trademarks or remove copyright management information (CMI) to obscure the source and facilitate unlicensed outputs of their content.
They appear beyond frustrated by what their court filings described as delay tactics from OpenAI and co-defendant Microsoft, which has agreed to share 8.1 million Copilot logs but won’t say exactly when those logs will be shared.
Late last year, news organizations asked the court to consider if sanctions on OpenAI may be warranted.
Allegedly, it took 11 months for news groups to learn that “OpenAI was destroying relevant output log data” by failing to suspend deletion practices as soon as litigation started—including a “quite substantial” fraction of ChatGPT Free, Pro, and Plus output log data. This data, which was allegedly deleted at a “disproportionately higher rate,” is most likely where infringing materials would be found, news groups claimed, as users prompting ChatGPT to skirt paywalls would most likely set chats to delete.
OpenAI provided “no explanation for why it was destroying roughly 1/3 of all user conversation data in the month after [The New York Times] filed suit other than the irrelevant non-sequitur that the ‘number of ChatGPT conversations was uncharacteristically low (shortly before New Year’s Day 2024),’” the filing said.
Describing OpenAI’s alleged “playbook” to dodge copyright claims, news groups accused OpenAI of failing to “take any steps to suspend its routine destruction practices.” There were also “two spikes in mass deletion” that OpenAI attributed to “technical issues.”
However, OpenAI made sure to retain outputs that could help its defense, the court filing alleged, including data from accounts cited in news organizations’ complaints.
OpenAI did not take the same care to preserve chats that could be used as evidence against it, news groups alleged, citing testimony from Mike Trinh, OpenAI’s associate general counsel. “In other words, OpenAI preserved evidence of the News Plaintiffs eliciting their own works from OpenAI’s products but deleted evidence of third-party users doing so,” the filing said.
It’s unclear how much data was deleted, plaintiffs alleged, since OpenAI won’t share “the most basic information” on its deletion practices. But it’s allegedly very clear that OpenAI could have done more to preserve the data, since Microsoft apparently had no trouble doing so with Copilot, the filing said.
News plaintiffs are hoping the court will agree that OpenAI and Microsoft aren’t fighting fair by delaying sharing logs, which they said prevents them from building their strongest case.
They’ve asked the court to order Microsoft to “immediately” produce Copilot logs “in a readily searchable remotely-accessible format,” proposing a deadline of January 9 or “within a day of the Court ruling on this motion.”
Microsoft declined Ars’ request for comment.
And as for OpenAI, it wants to know if the deleted logs, including “mass deletions,” can be retrieved, perhaps bringing millions more ChatGPT conversations into the litigation that users likely expected would never see the light of day again.
On top of possible sanctions, news plaintiffs asked the court to keep in place a preservation order blocking OpenAI from permanently deleting users’ temporary and deleted chats. They also want the court to order OpenAI to explain “the full scope of destroyed output log data for all of its products at issue” in the litigation and whether those deleted chats can be restored, so that news plaintiffs can examine them as evidence, too.
