Is it permissible for generative AI models to be trained on data that is copyright protected?
The issue emerges as a result of the training methods used by generative AI systems. They operate by recognising and recreating patterns in data, much like the majority of machine learning software. However, because these programmes produce code, prose, music, and art, the data they require is human-generated, web-scraped, and in some cases copyright-protected.
This was not a major concern for AI researchers in the distant past (the 2010s). At the time, the most advanced versions could only produce black-and-white, fuzzy images of faces the size of a fingernail. There was no immediate danger to anyone. But in 2022, when a single amateur can use software like Stable Diffusion to copy an artist’s style in a matter of hours or when businesses are selling AI-generated prints and social media filters that are overt ripoffs of living designers, questions of legality and ethics have become much more important.
Consider Hollie Mengert, a Disney illustrator who discovered that a Canadian mechanical engineering student had copied her art as part of an AI experiment. The student spent many hours training a machine learning model that could mimic Mengert’s style after downloading 32 of the artist’s works. For her, personally, it feels like someone’s taking work that she has done, things that she has learned — she has been a working artist since she graduated art school in 2011 — and is using them to create art that [sic] she didn’t consent to and didn’t give permission for, Mengert said to technologist Andy Baio, who reported the case.
But is that justified? And what can Mengert do about it?
The Verge talked to a variety of specialists, including lawyers, analysts, and staff members at AI startups, to get answers to these issues and gain an understanding of the legal environment around generative AI. Some others asserted with certainty that these systems might definitely violate copyright and might soon be subject to significant legal challenges. Others asserted, with equal assurance, that the opposite was true: that everything taking place in the field of generative AI right now is legal and that any legal actions are doomed to fail.
The truth is that nobody knows, according to Baio, who has been closely monitoring the generative AI scene. He see individuals on both sides of this incredibly certain in their opinions. And anyone who claims to know with certainty how this will turn out in court is mistaken.
There are a lot of unanswered problems regarding this subject, but according to Andres Guadamuz, a professor at the UK’s University of Sussex who specializes in AI and intellectual property law, there are only a few essential questions from which the topic’s many uncertainties emerge. First, can the output of a generative AI model be copyrighted, and if so, whose ownership is it? Second, does having copyrighted input used to train an AI give you any legal authority over the model or the products it produces?
After these queries are resolved, a bigger query surfaces: how do you handle the consequences of this technology? What types of legislative restrictions on data gathering should or ought to be put in place? Can there coexistence between those who are developing these technologies and those whose data is required to do so?
Let’s address each of these concerns one at a time.
Can you legally reproduce the work that an AI model produces?
At least for the first question, the solution is not too challenging. For works produced exclusively by machines, there is no copyright protection in the United States. But it appears that copyright may be conceivable when the developer can demonstrate there was significant human participation.
A comic book created with the aid of text-to-image AI Midjourney was given a first-of-its-kind registration by the US Copyright Office in September. The comic is a finished product; it is an 18-page story with dialogue, characters, and a standard comic book layout. The comic’s copyright registration hasn’t yet been revoked, despite reports that the USCO is revisiting its judgement in the wake of the incident.
The degree of human involvement in the comic’s creation appears to be a consideration in the evaluation. The work’s artist, Kristina Kashtanova, told IPWatchdog that the USCO had requested details of her approach to establish that there was substantial human engagement in the process of development of this graphic novel. (The USCO itself declines to comment on particular instances.)
Guadamuz asserts that granting copyright for works produced with the aid of AI will continue to be a problem. He claims that in the US, he doesn’t think it’s enough to get copyright if you just enter ‘cat by van Gogh’. However, he can completely see that being covered by copyright if you start experimenting with prompts and make numerous photographs and start fine-tuning your images, start using seeds, and start engineering a little more.
The level of human involvement will probably determine how much of an AI model’s output is protected by copyright.
Given this framework, it is most likely true that copyright protection cannot be applied to the great bulk of the output of generative AI models. They are typically produced in large quantities using only a few keywords as a cue. However, more complex procedures would result in better cases. These might contain contentious works, such as the AI-generated print that took first place in a state art competition. Given that the author claimed to have spent weeks refining his prompts and manually revising the final product, this case may have included a significant amount of intellectual effort.
A computer scientist named Giorgio Franceschelli who has written about the issues with AI copyright claims that assessing human input will be “particularly true” for cases decided in the EU. The law is also different in the UK, which is a significant area of concern for Western AI businesses. The UK, one of a select few countries, offers copyright protection for works produced entirely by computers, although it defines the author as “the person by whom the preparations necessary for the creation of the work are undertaken.” Again, there are a number of possible interpretations (would this “person” be the model’s creator or operator? ), but it sets the stage for the granting of some form of copyright protection.
Guadamuz warns, however, that registering copyright is merely the first step. He asserts, The US Copyright Office is not a court. If you want to file a lawsuit for copyright infringement, you must register, but the legal enforceability of that will be decided by the court.
Can you train AI models using data that is copyright protected?
The main concerns regarding AI and copyright, in the opinion of the majority of experts, revolve with the data utilised to train these models. The majority of systems are trained using massive volumes of text, code, or graphic content that has been retrieved from the web. One of the largest and most effective text-to-AI systems, Stable Diffusion, for instance, has billions of images in its training dataset that were gathered from hundreds of different websites, including personal blogs hosted on WordPress and Blogspot, art communities like DeviantArt, and stock photo websites like Shutterstock and Getty Images. There’s a strong chance you’re already in one of the enormous training datasets for generative AI—even there’s a website where you can check by uploading a photo or doing a text search.
AI researchers, start-ups, and multibillion-dollar tech corporations all use the fair use doctrine, which encourages the use of copyright-protected works to advance freedom of expression, as their reason for sharing these photos (at least in the US).
Daniel Gervais, a professor at Vanderbilt Law School who specialises in intellectual property law and has written extensively on how this overlaps with AI, notes that there are a variety of factors to take into account when determining whether something is fair use. However, he asserts that two elements are “much, much more prominent.” “What is the use’s purpose or nature, and what effect will it have on the market?” In other words, does the use-case harm the original creator’s way of life by stepping on their work and altering the nature of the content in some way, which is typically referred to as a “transformative” use?
Although it’s probably permissible to train generative AI on copyright-protected material, you might be able to utilize the same model for illicit purposes.
Gervais asserts that “it is considerably more likely than not” that training systems on copyrighted material will be covered by fair use in light of the emphasis placed on these elements. But when it comes to creating content, the same cannot always be said. In other words, you can use data from other individuals to build an AI model, but what you do with that model can be illegal. Consider the difference between producing fictitious money for a movie and attempting to buy a car with it.
Take into account applying the same text-to-image AI model to various situations. It’s highly unlikely that using the model to create new photographs once it has been trained on many millions of images would violate anyone’s copyright. The procedure has altered the training data, and the resultant product poses no threat to the original artwork’s market. An unhappy artist would have a far stronger case against you if you adjusted that model on 100 images created by a certain artist and produced images in that artist’s style.
You would be directly competing with Stephen King if you gave an AI 10 Stephen King novels and told it to Produce a Stephen King novel. Is that considered fair use? Most likely not, asserts Gervais.
Importantly, however, there are several situations in which input, intent, and output are all weighted differently and could tip any judicial decision one way or the other between these two poles of fair and unfair use.
The majority of businesses who offer these services are aware of these distinctions, according to Ryan Khurana, chief of staff at the generative AI company Wombo. He said via email to The Verge that “intentionally using prompts that draw on copyrighted works to generate an output […] violates the terms of service of every major participant.” However, “enforcement is tough,” he continues, and businesses are more focused on “finding ways to prohibit utilizing models in ways that violate copyright […] than limiting training data.” This is especially true for free and open-source text-to-image models like Stable Diffusion, which may be trained and applied without any restrictions or filters. The business might have masked its tracks, but it could also be encouraging usage that violates copyright.
The majority of businesses who offer these services are aware of these distinctions, according to Ryan Khurana, chief of staff at the generative AI company Wombo. He said via email that intentionally using prompts that draw on copyrighted works to generate an output […] violates the terms of service of every major participant. However, “enforcement is tough,” he continues, and businesses are more focused on “finding ways to prohibit utilizing models in ways that violate copyright […] than limiting training data.” This is especially true for free and open-source text-to-image models like Stable Diffusion, which may be trained and applied without any restrictions or filters. The business might have masked its tracks, but it could also be encouraging usage that violates the copyright.
This method has been nicknamed “AI data laundering” by Baio. He mentions that this technique has been utilized in the past to develop facial recognition AI software and cites the example of MegaFace, a dataset assembled by University of Washington researchers by scraping Flickr images. According to Baio, “the university researchers grabbed the data, cleaned it up, and exploited it by commercial enterprises.” He claims that the Chinese government, police enforcement, and the facial recognition company Clearview AI now have access to this data, which includes millions of private photographs. The developers of generative AI models will certainly benefit from being protected from responsibility as well thanks to such a tried-and-true laundering method.
There is one more twist to this, though, as Gervais points out that the next Supreme Court case involving Andy Warhol and Prince may cause the present definition of fair use to change. In this occasion, Warhol used images of Prince to produce artwork. Was this a fair use or a violation of someone else’s copyright?
The Supreme Court doesn’t deal with fair use very frequently, so when it does, it typically makes a significant decision. Gervais predicts that they will follow suit here. And it’s dangerous to assert that anything is settled law while you’re waiting for the Supreme Court to make a decision.
How can AI firms and artists coexist peacefully?
Even if it is determined that the training of generative AI models falls under fair use, the issues facing the industry will still not be fully resolved. It also won’t always apply to other generative AI domains, such as coding and music, and won’t appease artists who are upset that their work has been used to train commercial models. In light of this, the question is: what solutions, technical or otherwise, can be implemented to enable generative AI to grow while acknowledging or compensating the creators whose work made the field possible?
Obtaining a licence and paying the data’s producers is the most obvious recommendation. However, some believe that this will ruin the sector. The authors of “Fair Learning,” Bryan Casey and Mark Lemley, claim that training datasets are so big that “there is no plausible option simply to licence all of the underlying photographs, videos, audio files, or texts for the new use.” “Fair Learning” has become the cornerstone of arguments praising fair use for generative AI.
Others, however, note that we have successfully handled copyright issues of a similar size and complexity in the past and can do so again. The era of music piracy, when file-sharing programmes were developed on the back of widespread copyright violation and succeeded only until there were legal challenges that led to new arrangements that honoured copyright, was compared to by a number of experts.
So, in the early 2000s, there had Napster, which was popular but entirely unlawful. And today, we have services like Spotify and iTunes,” Matthew Butterick, a lawyer who is suing businesses for using data they have obtained through data scraping to train AI models, said earlier this month. How did these systems come to be? by businesses entering into licensing agreements and bringing in content lawfully. He finds it a little terrible that a comparable thing can’t happen with AI because all the stakeholders got together and made it work.
Researchers and businesses are already experimenting with different approaches to pay creators.
Ryan Khurana from Wombo projected a similar result. Because of the many licensing models, the diversity of rights holders, and the number of intermediaries involved, he said. Music has by far the most complex copyright rules. Given the complexities [of the legal issues surrounding AI], he believes the license structure for the entire generative field will eventually resemble that of music.
Trials of other options are also being conducted. For instance, Shutterstock wants to create a fund to pay artists whose work is used by AI firms to train their models, and DeviantArt has developed a metadata tag for online photos that cautions AI researchers against scraping their content. (At least one small social network, Cohost, has already incorporated the tag throughout its website and declares that it “won’t rule out legal action” if it discovers that researchers are still downloading its photographs.) However, responses to these strategies have been conflicting in the artistic communities. Can a one-time licence price ever make up for a lost income? How does a no-scraping tag that has just been implemented assist creators whose work has already been exploited to train commercial AI system?
It appears that the harm has already been done to many creators. However, AI businesses are at least offering fresh ideas for the future. The creation of databases by AI researchers where there is no chance of copyright violation, either because the content has been properly licensed or because it has been built with AI training in mind, is an obvious step forward. “The Stack,” a dataset for AI training created explicitly to avoid claims of copyright infringement, is one such example. Only code with the most lenient open-source licensing is included, and it provides developers with a simple method to have their data removed upon request. According to its authors, the industry may use its model.
The Stack was developed in partnership with partner ServiceNow, and according to Yacine Jernite, Machine Learning & Society lead at Hugging Face, “The Stack’s concept can easily be extended to different media.” It is a crucial first step in investigating the many different consent mechanisms that are available; these mechanisms function best when they take into account the policies of the platform from which the AI training data was pulled. Hugging Face, according to Jernite, aims to assist in bringing about a “fundamental shift” in how creators are viewed by AI researchers. However, the company’s strategy continues to be unusual for now.
What happens next?
Whatever our conclusion on these legal issues, the various players in the field of generative AI are already preparing for… something. The tech companies are firmly establishing themselves, continually asserting that what they do is legitimate (while presumably hoping no one actually challenges this claim). Copyright owners are tentatively establishing their own positions on the other side of the line without fully committing to any course of action.
Due to the potential legal risk to users, Getty Images recently prohibited AI material. RIAA, a trade organization for the music industry, stated that AI-powered music mixers and extractors violate members’ copyright. CEO Craig Peters stated, “he thinks it could be illegal.
But with the announcement last week of a proposed class action lawsuit against Microsoft, GitHub, and OpenAI, the battle for AI copyright has already begun. All three organizations are charged with intentionally copying open-source code using Copilot, an AI coding help, without the necessary licensing. The lawsuit’s attorneys said last week in an interview with that it might establish a precedent for the whole generative AI sector (though other experts disputed this, saying any copyright challenges involving code would likely be separate from those involving content like art and music).
Meanwhile, Guadamuz and Baio both express astonishment that there haven’t been more legal challenges yet. Guadamuz admits, Honestly, he is flabbergasted. But he believes it’s also due to the fact that these sectors are wary of being the first to sue and losing a ruling. However, he believes that once someone emerges, lawsuits will begin to fly in all directions.
One issue, according to Baio, is that many of the people most impacted by this technology—artists and others—are simply not in a position to file legal complaints. He claims, They don’t have the resources. This type of lawsuit is exceedingly costly and time-consuming, and you would only pursue it if you were confident in your ability to prevail. This is the reason he believed for a while that stock picture sites will bring the first legal actions with AI art. They appear to stand to lose the most from this technology, they can easily demonstrate that a significant portion of their corpus was used to train these models, and they have the resources to sue.
Guadamuz concurs. Everyone is aware of how expensive it will be, he claims. Whoever sues will receive a ruling from the lower courts, appeal, appeal again, and ultimately, it could go all the way to the Supreme Court.