With the release of ChatGPT in late 2022, generative AI entered the cultural zeitgeist. Not surprisingly, within a few months, the first generative AI lawsuits were filed in the U.S. (e.g., Andersen v. Stability AI, Getty v. Stability AI, Doe v. Github). Many more suits have since followed.
While generative AI raises a host of intellectual property issues, perhaps the most pressing question is whether using an author’s creative work to train a machine learning model without permission is considered fair use. In Johnson v. Anthropic (“Anthropic”) and Kadrey v. Meta (“Meta”), two courts in the Northern District of California (9th Circuit) provide some guidance. These cases offer a glimpse into the ongoing world war between the copyright infringement claims of copyright holders and the fair use defenses of generative AI developers, and demand the attention of companies and their counsel.
The relevant facts in Anthropic and Meta are similar. The heart of these cases lies in the defendants’ use of centralized libraries containing unauthorized copies of books to train large language models (LLMs). In Meta, all books at issue were pirated copies sourced from shadow libraries, while Anthropic involved both pirated books and digitized versions of physical books purchased, then destroyed. Both cases hinged on whether these actions qualified as “fair use” under U.S. copyright law.
In Meta, the court found that downloading and using illicit copies of books to train LLMs was highly transformative and absent evidence of market harm was fair use. Despite the illicit means by which the defendant obtained the books, the court seemed particularly persuaded by the highly transformative nature of using books to train LLMs that can generate massive amounts of new content with minimal effort, as well as Meta’s efforts to minimize the amount of the books output to a user. In particular, Meta had shown that the LLM never regurgitated more than 50 words from any particular book to a user. Despite finding fair use, the court expressed frustration that the plaintiffs had not presented more evidence that the output of LLMs is likely to dilute the market for their works, and suggested that future plaintiffs would be wise to more fully develop the record on market dilution.
In Anthropic, the court found that digitizing purchased physical books and storing the corresponding digital copies in a central library, as well as creating multiple unauthorized copies of books while training an LLM were fair use. Unlike in Meta, the Anthropic court found little merit in arguments related to market dilution, stating that generating competing works using LLMs “is not the kind of competitive or creative displacement that concerns the Copyright Act,” and that the Act does “not to protect authors against competition.”
With regards to storing pirated copies of books in a central library for training LLMs, the Anthropic court seemed to arrive at the opposite conclusion as the Meta court, going so far as to describe such behavior as “inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.”
While the war between copyright holders and LLM developers is still in its infancy, these cases provide several important takeaways for companies and their counsel.
For copyright holders, it appears to be mostly bad news. While the court in Meta suggested a novel legal theory to show market harm, it is unclear how market dilution would play out in practice and whether it aligns with established fair-use precedent. Thus, the market for specific licenses of copyrighted works for the purpose of training machine learning models may be severely limited, if it exists. As a consolation prize, LLM developers may buy one copy of a copyright holder’s book before digitizing it and using it to train their LLM, but even that revenue may be jeopardized if a majority of courts adopt the holding in Meta.
In contrast, for LLM developers, it appears to be mostly good news. First, it appears there are multiple paths to building the massive libraries needed to train their LLMs. Companies with a high-risk tolerance have at least one court that agrees that downloading and using pirated copies of books to train an LLM is fair use provided plaintiffs fail to show market harm, but buyer beware, there is already a court split on this issue, not to mention the potential reputational hit with copyright holders and the general public. The less risky alternative appears to be building libraries by digitizing physical copies of purchased books, then destroying the physical copies.
It may seem counterintuitive, but LLM developers may consider returning to the negotiating table with copyright holders. Given the outcome of these cases, copyright holders may be willing to accept terms more favorable to LLM developers. This approach may save LLM developers the time and energy associated with purchasing and digitizing individual books, and may facilitate a more productive relationship with copyright holders going forward.
The cases provide LLM developers additional guidance to reduce legal risk. Specifically, LLM developers should place guardrails on the output of their LLMs to eliminate or minimize the amount of a copyrighted work output to a user.
For companies using LLMs (i.e., everyone), consider the following when negotiating contracts with LLM developers:
- Audit training data. Require a list of all training datasets used to train or fine tune the developer’s LLM. Verify that the training datasets do not include pirated works from shadow libraries.
- Ensure Guardrails. Ensure the developer’s LLM has adequate guardrails to eliminate or minimize the output of copyrighted works. Assign one or more creative employees to stress test the guardrails.
- Request Indemnification. Consider seeking indemnity for copyright infringement attributable to the developer’s LLM. While these cases and most generative AI cases thus far have focused on LLM developers, there is still some risk for LLM users. Strong indemnity provisions can provide LLM users protection.
Meta and Anthropic are the first of what is expected to be a long line of court decisions related to generative AI and copyrights. These cases focused on LLMs trained using books, however, different data modalities may result in different outcomes. For example, it will be important to see how the analysis changes when the training data and output are image-based, such as in the case of Getty v. Stability AI, or when the output is more similar to the training data, such as in the case of Doe v. Github, where the training data and LLM output is computer code.
Given the highly fluid and global nature of this area of law, it is important that companies and their counsel track key developments around the world.