Writing Invisible Man took seven years for Ralph Ellison. Writing The Catcher in the Rye took J. D. Salinger almost ten years. The first Harry Potter novel took J. K. Rowling at least five years to write. Writing with the intention of publishing requires always taking a risk. Are you going to complete the project? Will anyone be interested in listening to it?
Whether authors understand it or not, copyright provides a strong justification for the risk. Who would invest so much time and emotional energy writing a book if it could be ripped off with no consequences? This is the underlying argument of at least nine recent copyright infringement lawsuits filed against businesses that train generative-AI systems with tens of thousands of copyrighted books, at the very least. Artificial intelligence businesses could face liability for hundreds of millions of dollars or more, as per one of the lawsuits that charges “systematic theft on a mass scale.”
Companies like OpenAI and Meta have responded by saying that, like people, their language models “learn” from books and create “transformative” original work. Thus, they assert, the training is lawful and no copies are being made. Referring to its generative-AI model, Meta stated in a court filing in response to one of the cases last fall that using texts to train LLaMA to statistically model language and generate original speech is fundamentally fair use and transformative by nature.
However, AI businesses use other people’s work “without consent, credit, or compensation” to create billion-dollar products, as artist Karla Ortiz revealed to a Senate subcommittee last year. For numerous authors and artists, the consequences are profound: machines pose a threat by substituting them with inexpensive, automated production, providing text and graphics at will.
The writers’ argument in these lawsuits is that copyright ought to prevent AI businesses from going farther in this direction. The instances address the central question of generative AI’s place in society: Does AI ultimately provide enough to offset the costs? A vibrant creative culture has been supported by copyright law since 1790. Is it lasting?
Contrary to popular assumption, creators are not the beneficiaries of copyright. Its aim is to cultivate a culture that yields outstanding works of science, art, literature, and music, according to founding documents and more contemporary interpretations. By giving creators of those works significant control over their dissemination and replication, it incentivizes them financially to complete the work.
Because of this concern for the general welfare, some “fair uses” of copyrighted works are also permitted under current law. Fair use includes, for example, using a story’s storyline and characters in a parody or printing brief quotes from books or showing photo thumbnails in search results. (Recall the Spaceballs?) Large language models (LLMs) are trained on copyrighted books, but because they create a new type of product and don’t replicate the entire text of the books, AI businesses argue that training LLMs on these books is also fair use.
We are now testing these claims. Is it in the public interest to let copyrighted content to be used for AI training? In a ruling on Thomson Reuters v. Ross Intelligence, a case involving the use of court records to train an artificial intelligence research tool, Judge Stephanos Bibas was questioned a few months ago. Each side has its own opinions about what is best for the general public, according to Bibas. Tech corporations contend that their AI products would increase access to knowledge, while plaintiffs contend that because AI products are usually marketed as the production of AI rather than as original work, they will lessen the incentives for sharing knowledge in the first place. The concept that existing AI-training procedures could have a chilling impact on human creativity is one that the courts will have to consider seriously, as some writers have already ceased publishing their work online.
The primary query addressed by copyright law is whether generative-AI products benefit society as a whole. According to legal expert Matthew Sag’s testimony to the Senate last year, any product that “substantially undermine[s] copyright incentives” may not be considered fair usage. The audience for books and articles (as well as the motivation for authors) will decline if people routinely question ChatGPT about them rather than reading them. Present AI techniques that display knowledge developed by humans without citing sources are already making it difficult for readers and authors to connect with like-minded individuals, endangering the wellbeing of research communities, and undermining the incentives for developing and sharing expertise. All of this might result in a society where knowledge is discouraged rather than promoted and where Salinger concludes that authoring The Catcher in the Rye is not worthwhile in the future. (OpenAI responded to these worries this week by insisting that ChatGPT is a tool for “efficiency” and “not in any way a substitute” for a subscription to the paper in a move to dismiss The New York Times’ lawsuit against the business.)
Tech firms and supporters of artificial intelligence have contended that if reading a book in a library does not require a special license for humans, then AI should act similarly. However, as noted by legal expert James Grimmelmann, a company may not always be justified in acting on a large scale for profit only because someone is doing something for self-education.
Regarding the claim that the use of AI for training “transforms” the original works of authors, the case that is typically cited as precedent is Authors Guild v. Google, in which the plaintiffs sued Google for using millions of books’ scans to produce the research product known as Google Books. Because Google Books primarily served as a research tool and had strict guidelines on how much copyrighted text it could reveal, the judge in that case determined that the scanning was fair use. Additionally, the purpose of Google Books—which is to provide insights into the entire book collection through its Ngram Viewer—was very different from the purpose of the books that were used to build it, which are meant to be read.
However, generative-AI products like DALL-E and ChatGPT don’t necessarily have distinct uses from the literature and artwork they were trained on. AI-generated text and graphics can take the place of buying a book or hiring an illustrator. Additionally, while the output of an LLM typically differs from the text used for training, this isn’t always the case.
It has been demonstrated that LLMs occasionally replicate their training text by the recent Times lawsuit and another brought by Universal Music Group. Anthropic’s LLM Claude, according to UMG, can mimic whole song lyrics almost exactly and show them to users as unique compositions. The Times demonstrated ChatGPT’s ability to mimic lengthy passages from Times stories. This process is known as “memorization.” At the moment, eliminating it is challenging, if not impossible. To a certain extent, it can be hidden, but the intricacy and unpredictability of generative AI (sometimes referred to as a “black box”) prevent its developers from making any promises regarding the frequency and conditions under which an LLM uses its training data. Imagine if a journalist or student refused to swear they would not plagiarize. These products are adopting a position that raises ethical concerns.
Sometimes, courts have had trouble utilizing modern technologies. Think about the player piano, which is a device that accepts paper rolls as input. The rolls are sheet music with punched holes in place of written symbols for the notes. Once, a piano-roll maker was sued by a sheet music publisher who said the company was producing and distributing unauthorized copies. The matter was taken all the way to the Supreme Court, which ruled in 1908 that the rolls were just a component of the player piano’s “machinery” and not replicas. Looking back, it was a strange choice. It’s analogous to claiming that a DVD isn’t a movie duplicate since the sound and visuals are encoded digitally rather than analogically.
The Copyright Act of 1909, which stated that manufacturers of piano rolls did indeed due royalties, made the ruling moot. However, as Grimmelmann informed me, arcane technologies of reproduction are not always trustworthy. When they are separated from the intellectual property that drives them, they may appear mystical or incomprehensible.
Some question whether generative AI can be handled under copyright law, which has essentially remained the same since the late 1700s. Its fundamental building block is the “copy,” an idea that hasn’t felt very contemporary since the introduction of music and video streaming in the 1990s. Could copyright ultimately be bent past its breaking point by generative AI? After discussing this with William Patry, a senior copyright attorney at Google during the Authors Guild dispute and a former senior official at the U.S. Copyright Office, whose copyright treatises are among the most often referenced by federal courts. He told that he wrote laws for a living for seven years. “It’s not simple.” According to him, new technologies regularly emerge that pose a challenge to established legal frameworks and social mores, but the law cannot continually adapt to make room for them.
Though excellent rules most likely need to be this way, copyright terminology can seem painfully antiquated. Both sturdy and dynamic—“in the sense of having play in the joints”—are necessary so that we know what to expect, according to Patry, author of the book How to Fix Copyright. While he believes AI won’t be the technology to ultimately break the law, he is critical of some parts of it.
Rather, he stated that judges might exercise caution while making decisions. It’s unlikely that AI training will be ruled outright. Depending on what features a product has or how frequently it quotes from its training data, judges may rule that it is okay to train some AI products but not others, rather than declaring that “AI training is fair use.” Different regulations for noncommercial and commercial AI systems may potentially be developed in the end. According to Grimmelmann, judges may even take into account unrelated elements like whether a defendant has been developing its AI products professionally or carelessly. Judges must make difficult choices in any case. As Bibas acknowledged, it is dangerous and an awkward position for a court to decide whether protecting a creator or a copier serves the public interest.
Writing novels, investigative journalism, and well-researched nonfiction were all lost, and generative AI could not make up for it. It can only create copies of past work because it is a statistical prediction system that works with data that it has already encountered. It would completely halt a culture since it is the preeminent form of creation. We simply won’t have a culture without those things if human authors aren’t driven to create and publish works that affect us, aid in empathy, and transport us to fantastical settings that change our viewpoint and enable us to see the world more clearly. While synthetic memories of the past may be provided by generative AI, can it assist in future planning?