Key Takeaways: While the legal landscape is continuing to take shape, a number of recent court decisions indicate that unlicensed use of copyrighted data to train AI models generally constitutes fair use. Each fair use inquiry depends on the facts at play, however, including, among other things, how the AI model’s output compares to the copyrighted data that was input into the model.
As generative AI continues to scale, one legal question sits at the center of industry debate: what restrictions does copyright law place on using data to train AI models? The answer has begun to emerge from the U.S. courts and is generally favorable to AI developers but with important caveats. This post breaks down those court decisions and explains how companies can navigate them as they train their AI models.
Training
Effectively training AI models often requires mass amounts of copyrighted data. Obtaining licenses to use that data from every copyright holder can be expensive and time-consuming, particularly if there is no central data broker. But unlicensed use of such data can expose AI companies to copyright infringement liability, with potential damages running into the millions. Thus, at the beginning of the AI revolution, developers were caught between a rock and a hard place. In many cases, developers trained their AI models with unlicensed data, spawning a wave of copyright infringement lawsuits that have produced a consistent theme—using data to train AI models generally constitutes “fair use.” “Fair use” is an affirmative defense to copyright infringement and turns on the four statutory fair use factors detailed below.
The first factor turns on the purpose and character of the AI model’s output. If data is used to train an AI model for a “transformative” output, this factor favors fair use. Conversely, if the output serves the same purpose as the original data, this factor weighs against fair use. For example, in the landmark case Bartz v. Anthropic, 787 F. Supp. 3d 1007, 1022 (N.D. Cal. 2025), Anthropic’s use of copyrighted books to train Claude was found to be fair use, where the output was “spectacularly” different. “Everyone reads texts, too, then writes new texts[,]” the court reasoned, and that is not something a copyright holder can exclude others from doing.
On the other hand, in Thomson Reuters v. Ross, 765 F. Supp. 3d 382, 399 (D. Del. 2025), copying Westlaw headnotes as a shortcut to create a competing legal research tool was not fair use. In that case, the court considered the product a commercial substitute, not transformative, and distinguished Ross’s copying from situations in which copying is necessary to reach the underlying ideas.
The second factor turns on the nature of the copyrighted work at issue. Works selected for their originality are given more protection than functional or factual works. AI models can be trained on any type of work, so whether this factor favors fair use or not will vary. In Sega, for example, the court found that Sega’s video game programs contained unprotected aspects “that [could not] be examined without copying” and, thus, concluded that this factor favored fair use. Sega v. Accolade, 977 F.2d 1510, 1526 (9th Cir. 1992). Contrast that with Bartz, where Anthropic admitted that all of the copyrighted books “whether non-fiction or fiction” contained expressive elements. Bartz, 787 F. Supp. 3d at 1017, 1029. There, the second factor weighed against fair use.
The third factor turns on how much of the copyrighted work is shown in the AI model’s output. If only a small amount appears, this factor favors fair use. Whether the amount copied is substantial or not may depend on whether the output effectively serves as a competing substitute for the copyrighted work. An 11-year-old case involving Google’s publication of “snippets” of books sheds light on where to draw the line. Authors Guild Inc. v. Google, 804 F.3d 202, 221–22 (2d Cir. 2015). There, Authors Guild showed they were able to access as much as 16% of a text using Google’s snippet feature. Id. But the snippets were typically “scattered randomly throughout the book” and had a number of other limitations that protected against their effective substitution for the books in question. Id. Thus, this factor favored fair use.
The fourth factor turns on the extent of market harm. In Kadrey v. Meta Platforms, the court identified three possible types of markets a plaintiff could point to in arguing that copying to train a generative AI model could harm the market:
- The market for the works themselves, if the AI model’s output includes copies or substantially similar versions of the works;
- The market for the copyright holder to license its works for AI training purposes; and
- The market for the works themselves, if the AI model’s output includes new works that are neither copies nor substantially similar versions but are still similar enough to effectively compete as substitutes for the works.
Kadrey v. Meta Platforms, Inc., 788 F. Supp. 3d 1026, 1051 (N.D. Cal. 2025). The first market theory will typically favor fair use by AI developers whose models’ outputs are transformative. As the Kadrey court explained, the second market theory does not fit into the fair use analysis, because the market for licensing is not one that a copyright holder is legally entitled to monopolize. Id. at 1052. Though it found no market harm on the arguments and evidence before it, the Kadrey court expressed openness to measuring harm using the third market theory in other cases. Copyright holders may take that cue and develop third-market-theory based arguments in future cases, potentially swaying the fourth fair use factor against AI developers.
Conclusion
The initial wave of copyright lawsuits against AI developers indicates that the fair use factors, taken together, will generally favor fair use. Each case will, of course, depend on the particular facts at play. And, as litigation continues, the courts will continue to develop the law in this area and address open issues, including how to apply the third theory of market harm.