Generative AI and Copyright Law: A Primer
Everything you need to know about how generative AI is challenging copyright law
Generative AI models are trained on millions of articles, images, songs and videos which are often used by AI companies without the consent or acknowledgement of the original creators.
Creators have argued this constitutes ‘systematic theft on a mass scale’ and have brought multiple lawsuits across several jurisdictions against these companies for infringement of copyright.
The question is how successfully can copyright law be adopted to protect artists from this novel threat? AI companies have argued it is in the public benefit to allow them to train their models on copyrighted material, while artists have claimed that generative AI threatens the very ability for society to foster art and creativity.
Generative AI poses a significant challenge to traditional copyright law and the outcomes of legal actions may have major ramifications for AI companies, artists and creative industries.
Below is a high-level summary of the major issues followed by answers to most questions you might have and links to important cases, statements and other resources. This primer should not be taken as legal advice, but do let me know if you think of any cases or resources I should add or notice any inaccuracies and I will update. Ok, let’s delve into it ;)
Summary of where we are at (last updated 15 April 2024):
- We don’t have a definitive answer to most questions. Many cases have not yet concluded so we only have a few summary judgments and decisions on minor points. Overall, cases have leaned towards AI companies, but much still hangs in the balance.
- We can differentiate between: 1. The training of models on copyrighted materials which could breach the law (i.e. this may not be ‘fair use’ of the material); and 2. AI outputs that reproduce part or all of copyrighted material which may breach the law (it’s an unlawful reproduction of the work).
- AI companies argue their use of copyrighted works is ‘fair use’ because it is a ‘transformative’ use of the original works. There are signs that this argument will win out over creators’ claims (see Tremblay v OpenAI). But this question has not yet been directly answered by the courts.
- In similar cases to gen AI over the past decades, there are a number of examples when non-consensual copying was done to achieve some social benefit and it was found to be fair use (eg. Google’s digitisation of books in Authors Guild v Google 2015).
- But courts have not yet decided whether training machine learning models is ‘learning’ rather than ‘copying’ copyrighted works and if this distinction will have an impact on cases. My feeling is that it will be difficult for creators to make the case this is not fair use, but it will depend on the facts.
- Only humans can hold copyright (except in the UK, see below). Courts have been unwilling to extend authorship rights to purportedly autonomous AI systems (see Thaler v Perlmutter 2023).
- For individuals to own copyright over gen AI outputs they will have to do more than simply type in prompts. The US Copyright Office has stated that human prompters do not exercise sufficient creative control over gen AI tools to be considered authors of AI outputs (even after 600+ prompts).
- Most of the cases and analysis are focussed on the US, but jurisdictions could swing in different directions. This could lead to AI companies engaging in ‘jurisdiction shopping’ as to where they train their models. This hasn’t become an issue yet, but was a key point in the summary judgment of the UK case, Getty v Stability AI.
- For the many artists, authors and creatives organising against generative AI, legal avenues are by no means a silver bullet and the answer may well be political pressure against companies developing these tools and the many other organisations that may wish to use them. Boycotts, union agreements, petitions and other forms of collective organising are effective routes for opposing the theft of artists’ works and the spread of generative AI media across society.
What are AI companies saying?
Most gen AI companies argue that the use of text, audio, images and video to train their models (even those under copyright) is ‘fair use’ and the courts should allow their use for training so the public can benefit from the development of these tools.
For example, OpenAI’s Response to the House of Lords Communications and Digital Select Committee inquiry: Large language models (UK) 5 December 2023:
‘Because copyright today covers virtually every sort of human expression including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.’
Another indicative response is from Meta in Silverman v Open AI:
‘Use of texts to train LLaMA to statistically model language and generate original expression is transformative by nature and quintessential fair use’.
OpenAI and others have admitted to using copyrighted works but believe this is fair use and the only way such technology can be developed.
David Holz, founder and CEO of Midjourney, has also admitted to training Midjourney on copyrighted works without permission from copyright holders:
‘No. There isn’t really a way to get a hundred million images and know where they’re coming from. It would be cool if images had metadata embedded in them about the copyright owner or something. But that’s not a thing; there’s not a registry. There’s no way to find a picture on the Internet, and then automatically trace it to an owner and then have any way of doing anything to authenticate it.’
A more detailed explanation can be found in OpenAI’s Response to the Request for Comment Regarding Request for Comments on Intellectual Property Protection for Artificial Intelligence Innovation (US):
‘we draw on our experience in developing cutting-edge technical AI systems, including by the use of large, publicly available datasets that include copyrighted works.’
‘For the above reasons, we believe that courts would and should rule that training AI systems on copyrighted works constitutes fair use. However, given the lack of case law on point, OpenAI and other AI developers like us face substantial legal uncertainty and compliance costs. Resolving this issue by holding training AI systems to be fair use would eliminate the uncertainty in this area and remove substantial barriers to the development of innovative AI systems.’
‘We submit that:
I. Under current law, training AI systems constitutes fair use.
II. Policy considerations underlying fair use doctrine support the finding that training AI systems constitute fair use.
III. Nevertheless, legal uncertainty on the copyright implications of training AI systems imposes substantial costs on AI developers and so should be authoritatively resolved.’
One of the largest troves of evidence is from: Copyright Office Issues Notice of Inquiry on Copyright and Artificial Intelligence (30 August 2023). All the big gen AI companies submitted to this Copyright Office Inquiry including Open AI, Microsoft, Google, Stability AI, and Hugging Face. They argued that training gen AI on copyrighted material should be considered fair use. The Verge collected a bunch of their top lines here. Unsurprisingly, they all thought training AI models was fair use.
Stability AI offers an indicative statement:
‘We believe that training AI models is an acceptable, transformative and socially beneficial use of existing content that is protected by the fair use doctrine and furthers the objectives of copyright law, including to “promote the progress of science and useful arts”’.
The Stability AI VP Ed Newton-Rex resigned after this, stating:
‘I've resigned from my role leading the Audio team at Stability AI, because I don't agree with the company's opinion that training generative AI models on copyrighted works is “fair use”.’
Newton Rex went on to found a non-profit group Fairly Trained with plans to certify AI models that ask permission to use copyrighted material.
What is ‘fair use’?
In the US, a number of gen AI cases will turn on how courts interpret the ‘fair use doctrine’ (17 U.S.C. § 107), which permits the non-consensual use of copyrighted works ‘for purposes such as criticism (including satire), comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research,’ and for a ‘transformative use’ of the copyrighted material in a different manner than that which it was originally intended.
Copyright infringement requires not just copying of a work’s material form but also the unauthorised use of the work for its expressive purpose. Merely technical or non-communicative uses are not uses of a work for its expressive purpose and therefore are not copyright infringement. So it’s not just copying that will get you in trouble, you have to take this extra step.
The fair use doctrine has been the major argument of AI companies using copyrighted works without the permission of their creators. Whether or not this constitutes fair use depends on four statutory factors:
1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
2. the nature of the copyrighted work;
3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
4. the effect of the use upon the potential market for or value of the copyrighted work.
These four factors have been assigned different levels of importance in various cases. In considering these four factors, the courts will assess whether the social benefit of the use of the copyrighted work is justified in relation to the potential harm it may cause the copyright owner. This is why many of the AI companies’ responses to lawsuits emphasises the novelty and importance of LLMs.
1. Regarding the first factor, the US Supreme Court held in Campbell v Acuff-Rose Music (1994) that when the purpose of the use is ‘transformative’ rather than mere reproduction it will likely favour fair use. AI companies have argued that using copyrighted works to train models is not merely expressive, but constitutes just such a ‘transformative’ use. On this point, showing a one-off verbatim quote or very similar image may not be sufficient to claim the tools are not transformative.
2. The nature of some works means they will be more likely to be assigned importance in fair use tests. When the works are significant, highly original and the products of human creativity, there will be a higher threshold for whether they can be subject to fair use (songs, novels, paintings etc., whereas ‘factual’ works are more likely to be fair use).
3. How much of the work was used? Most AI companies use all the aspects of a copyrighted works in the training and so one would imagine that this element would go against these companies. But the outputs do not make all of these works immediately available to the public, so there is some scope here for companies to claim the amount made available to the public is relatively small. AI companies have argued that copyrighted works are not used in a manner that make them freely available, but are only used to train their gen AI models. This is an odd interpretation of the third factor given it says nothing about making something available to the public. But this factor is one of the reasons why plaintiffs have attempted to show the models reproduce entire or significant amounts of copyrighted works.
4. The effect the reproduction has on the value of the original. This seems subject to interpretation. Many artists have claimed that this will devalue all human creative work and there are already signs that this is already taking place. AI companies will claim that their tools offer unique products that supplement rather than replace existing services, but this seems like it will be difficult to uphold given the likely economic effects of AI on creative industries.
There have been many fair use cases that have been brought before US courts relating to non-consensual copying, with most of these cases found to be fair use when it related to advancing scientific knowledge, improving digital systems or providing greater access to information. For example, in Authors Guild v Google, the court held that the digitisation of large volumes of copyrighted books was fair use in order to reveal new information about them.
Are models ‘copying’ or ‘learning’?
For the ‘copying’ camp, there is some compelling evidence:
Gary Marcus and Reid Southen have argued that models such as Midjourney and Dall-E can reproduce remarkably similar versions of images in their dataset.
Carlini and co-authors from Google Research and the University of Pennsylvania have published findings that ‘Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim.’
The New York Times lawsuit against OpenAI shows numerous examples of OpenAI software reproducing stories nearly word for word from the newspaper. Although, in its response OpenAI claims that this ‘regurgitation’ is a ‘rare bug’ and that the Times must have ‘intentionally manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate’.
AI companies have insisted that machine learning techniques should be recognised by the courts as something distinct from copying works and that the training process does not involve models storing a copy of the works in a database.
In Andersen v Stability AI (2023), Judge Orrick expressed scepticism concerning plaintiffs’ claims that the model Stable Diffusion was itself a ‘derivative work’ because it allegedly stored ‘compressed copies’ of the copyrighted images it was trained on. The judge asked plaintiffs to clarify their theory, define ‘compressed copies’ and to provide more facts for their argument. As this case progresses, it may shed more light on this issue.
The EU’s AI Act and copyright
The EU’s AI Act was not originally intended to touch upon copyright law, but advances in LLMs led drafters to include some provisions. Under Article 53, developers of general purpose AI models will have to disclose any copyrighted material used to develop their systems and have policies in place that respect European Union copyright law, in particular allowing for any opt-outs that copyright owners have placed on their works. AI companies can take advantage of an exception to copyright in the EU’s DSM Directive for what’s known as ‘Text and Data Mining’ for non-scientific and non-research purposes, which includes AI training. This means developers should be able to use copyrighted works for training AI so long as they can provide a ‘sufficiently detailed summary of the content used for training’ and they respect opt-outs.
Can non-human agents enjoy copyright protection?
Probably not. ‘Authors’ must be human beings. In a 2023 judgment by a US District Court in Thaler v Perlmutter, the court held that ‘human authorship is an essential part of a valid copyright claim’. Stephen Thaler had wanted to register a visual artwork he claimed was authored autonomously by an AI program he called ‘the Creativity Machine’. The court upheld the longstanding position that copyright law only affords protection to works by human beings. In the case, Judge Howell argued that ‘human creativity is the sine qua non at the core of copyrightability, even as that human creativity is channeled through new tools or into new media.’ In previous cases, US courts have denied that a monkey snapping a selfie could own copyright over the picture or that a work could have been composed by ‘the Holy Spirit’. It’s humans all the way down.
Are AI-generated works copyrightable by human prompters?
Not in most cases, particularly in the United States. In March 2023, the US Copyright Office released guidance that when AI ‘determines the expressive elements of its output, the generated material is not the product of human authorship.’ In general, the Copyright Office has held that AI models could not be seen as ‘a tool… controlled and guided to reach [their] desired image… because it generates images in an unpredictable way’ (see the USCO’s decision in ‘Zarya of the Dawn’). In one case, even 600+ prompts was not enough to demonstrate sufficient input over the specific expressive or aesthetic component of the output to allow the individual to claim copyright. An individual would have to show a greater level of control over the creative output that is more than just prompting. However, the Copyright Office is not a court and it remains to be seen whether federal courts will uphold its reasoning.
In the United Kingdom, works generated solely by a computer can be copyrighted. The ‘author’ of a ‘computer generated work’ is defined (quite ambiguously!) as ‘the person by whom the arrangements necessary for the creation of the work are undertaken’. It remains unclear as to whether this would be seen as the developer or the user of the generative AI system.
In China, in November 2023 the Beijing Internet Court held that a work generated with Stable Diffusion could have copyright which belonged not to AI developers but to the prompter. Unlike in the US, the court ruled that the prompter contributed to the production of the work with intellectual input and aesthetic choices that created an original work of art. This marks a striking contrast between the United States and China’s approach to gen AI and copyright.
What comes next?
Licensing Deals – We will likely see an increase in the number of licensing agreements struck between gen AI companies and publishers of copyrighted works such as the deal between OpenAI and Axel Springer, OpenAI and The Associated Press, and OpenAI and Shutterstock. Such deals offer AI companies the possibility of avoiding expensive litigation by paying for access to quality training data.
Updates at the US Copyright Office – Following an initiative to examine the impact of generative AI on copyright law and policy, the US Copyright Office has begun publishing sections of a new report on potential recommendations about any legislative or regulatory action. In addition, it will publish an update to the Compendium of U.S. Copyright Office Practices, the administrative manual for registration. You can follow updates at the Office here.
More litigation – The cases currently before the courts could take a number of years and it will likely be some time before we have clear answers on how copyright law will develop in light of generative AI. In the meantime, expect a lot more litigation and ambiguity about where different parties stand on these issues.
New legislation – A range of regulatory approaches have been suggested, among them, a new bill introduced by Adam Schiff, the Generative AI Copyright Disclosure Act, which would require the disclosure of any copyrighted works used in the training of gen AI systems before their release.
Resistance from creators – Some artists have called for publishers to ban generative AI, while others have called on fans to boycott companies that use AI generated art. There has been a huge amount of pushback on the introduction of AI art into creative industries and this is likely to continue for the foreseeable future.
Summary of main cases:
The Knowing Machines research project also has a useful legal explainer with some info on the main cases (although limited to the US context and last updated Nov 2023).
United States
Tremblay v OpenAI (combined with Silverman v OpenAI; Chabon v OpenAI)
Order on Motion to Dismiss: 12 Feb 2024 (Californian District Court)
Main takeaway: In this case, the court didn’t think that the authors had sufficiently demonstrated direct infringement, which is a positive sign for the AI companies as it highlights the difficulty of proving gen AI outputs as direct copies of copyrighted works incorporated in their training data.
What’s it about? In a class action copyright case, Paul Tremblay, Sarah Silverman, Christopher Golden and Richard Kadrey and others claimed OpenAI (and Meta) illegally acquired their copyrighted works through shadow libraries and used them to train their models. The authors presented evidence that the chatbots could summarise their books demonstratings the authors’ works had been used in their training. They asserted six types of copyright violations including negligence and unjust enrichment.
What’s happened? On 12 February 2024, the District Court judge partially dismissed the claims of Silverman, Tremblay and others on the grounds there was not a ‘substantial similarity’ between their books and ChatGPT’s output. The court was not convinced of the claim that ‘every output of the OpenAI Language Models is an infringing derivative work’. However, OpenAI still faces the claim that it violated unfair competition law by using copyrighted works without permission. The judge gave leave to the applicants to amend their case. The case is ongoing.
New York Times v OpenAI
What’s it about? The New York Times alleges OpenAI’s ChatGPT was trained on millions of its articles and can reproduce them verbatim. The Times claims OpenAI gives ‘particular emphasis’ to their articles and ‘seek to free-ride on the Times’s massive investment in its journalism by using it to build substitutive products without permission or payment’. In contrast to many other cases that focus of the use of copyrighted works in the training data, this case also focuses on the outputs, with the Times presenting cases of reproduced articles. No judgment has yet been made.
What’s happened? On 27 December 2023, the New York Times filed a complaint in the Southern District of New York against Microsoft and OpenAI. OpenAI published a statement on 8 January 2024 stating, among other things that ‘training is fair use’, it ‘provides an opt-out’ and ‘“Regurgitation” is a rare bug that we are working to drive to zero’. On 27 February 2024 it then filed to have parts of the lawsuit dismissed, arguing that the newspaper ‘hacked’ its chatbot through ‘deceptive prompts that blatantly violate OpenAI's terms of use’. It claims the Times was only able to generate their evidence ‘by targeting and exploiting a bug’ and it took ‘tens of thousands of attempts to generate the highly anomalous results’ presented in the case. In a separate motion to dismiss filed on 4 March 2024, Microsoft likened the Time’s criticisms of this new technology with Hollywood executives’ initial backlash against the VCR. See Andres Guadamuz’ interesting discussion of Microsoft’s use of the ‘Sony Doctrine’ in relation to this point.
Andersen v Stability AI
Order on Motion to Dismiss: 30 November 2023 (Northern District of California)
Main takeaway: The court dismissed most of the plaintiffs’ case, noting it was ‘defective in numerous respects’. But, it left most questions open, allowing the plaintiffs to ‘clarify their theory’ about how copies of training images were used in the models for each respective defendant.
What’s it about? Sarah Andersen, Kelly McKernan and Karla Ortiz brought a case against Stability AI, DeviantArt and Midjourney alleging the companies’ models had been trained on copyrighted works of the artists. The artists claimed the models produced outputs in the particular artistic styles of the artists and this constituted direct copyright infringement, vicarious copyright infringement and violated the Digital Millennium Copyright Act. Andersen relied on evidence of an online tool, https://haveibeentrained.com/ to assert that her works are present in the training datasets of the gen AI models.
What happened? On 30 October 2023, District Judge William Orrick dismissed all but one of the artists’ claims and gave the plaintiff leave to amend and clarify their case. The judge found that the evidence provided by the online tool is ‘sufficient basis to allow her copyright claims to proceed at this juncture’. It didn’t help the artists that neither Ortiz nor McKernan had any valid copyright registrations so the case was narrowed to those works that Andersen had registered with the Copyright Office. The judge dismissed claims that AI outputs in general were ‘derivative works’ because of a lack of ‘substantial similarity’ to copyrighted content.
UK
Getty v Stability AI
Summary Judgment: 1 December 2023 (English High Court)
Main takeaway: The training of a gen AI model would not constitute an infringement of English law if the actual training of the model didn’t take place in the UK.
What’s it about? Getty Images claims Stability AI scraped millions of its images and used them unlawfully to train Stable Diffusion and that the model’s outputs also infringe its copyright under the UK’s Copyright Designs and Patents Act 1988.
What’s happened? The judge ordered the case to go to trial and that Getty’s claims would not be struck out at an interim hearing. A lot of the case turned on whether any part of Stability’s model was trained in the UK. Stability AI claimed none of it was so the case would fail. The judge found insufficient evidence to grant summary judgment on this point, meaning while it looked like the model wasn’t trained in the UK, the case should go to trial to determine this point. The judgment implied that if the model was not trained in the UK the case would be dismissed, raising the prospect that companies could simply train their models in favourable jurisdictions. Stability claimed no UK staff worked on it and that computing power was accessed from the US. At issue were the location of development teams and any resources used in training the software. If AI software is not deemed to be ‘an article’ according to UK copyright law, it would mean that only the training stage rather than the actual reproduction of images would be an issue.
More Resources:
- Everything written by Andres Guadamuz, Reader in Intellectual Property Law at the University of Sussex, at Technollama
- Harry Jiang et al., ‘AI Art and its Impact on Artists’ – a paper outlining the multiple harms creators suffer due to generative AI and a number of suggestions for how these could be improved including forcing AI companies to disclose training data and exploring tools for artists to prevent their work from being used without their consent.
- Shawn Shan et al., ‘Glaze: Protecting Artists from Style Mimicry by Text-to-Image Models’ – shares the details of a tool that enables artists to apply ‘style cloaks’ to their art before sharing online.
- Peter Henderson et al. ‘Foundation Models and Fair Use’ – this paper analyses issues with foundation models and their ability to adequately adhere to the fair use doctrine, including technical mitigations to help models stay in line with US law.
- Katherine Lee et al., ‘Talkin’ ‘Bout AI Generation: Copyright and the Generative-AI Supply Chain’ – introduces the idea of a ‘generative AI supply chain to break down the different moments where AI companies will have to consider relevant copyright law.
- This discussion between The Verge’s Nilay Patel and Sarah Jeong on generative AI and copyright. An interesting discussion pondering whether copyright cases could be ‘an extinction level event’ for generative AI, although I disagree with their analysis and their view that the law (and fair use in particular) is somehow ‘vibes based’ rather than the interpretation and application of established principles to novel facts.
- Paris Marx interviews Karla Oritz on ‘How artists are fighting generative AI’.
- Another Paris Marx interview, this time with Molly Crabapple on ‘Why AI is a Threat to Artists’.