Authors Paul Tremblay and Mona Awad filed a class-action complaint in California federal court alleging OpenAI broke copyright law by training its software to "ingest" their books without permission.
ChatGPT, a large language model, is "trained" by copying massive amounts of text and extracting expressive information from it to form a compilation of input material known as the "training dataset," according to the complaint filed in U.S. District Court in San Francisco.
The lawsuit says neither Tremblay nor Awad, both writers who live in Massachusetts, consented to the use of their copyrighted books as training material for ChatGPT. Nonetheless, "their copyrighted materials were ingested and used to train ChatGPT."
Tremblay owns registered copyrights in several books, including "The Cabin at the End of the World." Awad owns registered copyrights in several books, including "13 Ways of Looking at a Fat Girl" and "Bunny."
"Indeed, when ChatGPT is prompted, ChatGPT generates summaries of Plaintiffs’ copyrighted works — something only possible if ChatGPT was trained on Plaintiffs’ copyrighted works," the 17-page complaint says. "Defendants, by and through the use of ChatGPT, benefit commercial and profit richly from the use of Plaintiffs’ and Class members’ copyrighted materials."
The complaint cites a June 2018 paper in which OpenAI revealed it trained its GPT-1 tool on BookCorpus, a collection of "over 7,000 unique unpublished books from a variety of genres, including Adventure, Fantasy, and Romance."
"OpenAI confirmed why a dataset of books was so valuable: ‘Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information.’ Hundreds of large language models have been trained on BookCorpus, including those made by OpenAI, Google, Amazon, and others," the complaint notes.
Andres Guadamuz, a reader in intellectual property law at the University of Sussex, told The Guardian the complaint represents the first against OpenAI regarding copyright law.
Joseph Saveri and Matthew Butterick, attorneys representing the authors, told the newspaper using books to train large language models is ideal because they contain "high-quality, well-edited, long-form prose," essentially forming "the gold standard of idea storage for our species."
"Defendants breached their duties by negligently, carelessly, and recklessly collecting, maintaining and controlling Plaintiffs’ and Class members’ Infringed Works and engineering, designing, maintaining and controlling systems — including ChatGPT — which are trained on Plaintiffs’ and Class members’ Infringed Works without their authorization," the complaint says.
The lawsuit seeks an award of statutory and other damages.
Fox News Digital reached out to OpenAI for comment Wednesday but did not immediately hear back.