Friday, April 17, 2026

Panel 5: Copyrightable Subject Matter and the Special Problem of Software

29th Annual BTLJ-BCLT Spring Symposium: Origins, Evolution, and Possible Futures of the 1976 Copyright Act

Pamela Samuelson, UC Berkeley Law (Moderator and Speaker): discusses history (in which she was intimately involved as an intellectual powerhouse). From uncertainty over whether software was protectable to Whelan which gave very broad protection; took 6 years for the Second Circuit to respond and start with Baker v. Selden to keep functional elements out of © protection. Merger, scenes a faire, 102(b), fair use—doctrinal cocktails, in the words of Molly van Houweling.

Samuelson initially thought sui generis protection for software would be better, but admits error: © did a really good job and gave an international standard that’s enabled some stability.

Jule Sigall, former Microsoft: CONTU was doing its work as Microsoft was just getting started. Trade secrets, patents, and copyrights do different work at different eras of software. 1980: PC era—rapid rise of copyright’s relevance. Business model: product licenses. Practical control: EULA, shrinkwrap, key disc/dongle. Copyright’s salience for executives was high for how they were going to recover fixed cost investment. This was the model CONTU had in mind when it decided to embrace software ©: you make a product & send it out through distribution channels not unlike books.

1990s: WWW. Easier to send software as bits. Business model if people won’t necessarily pay for copies: hardware bundling (Apple; PC with independent OEMs); ad supported. Practical control: B2B contracts. Copyright salience: medium.

2000s: cloud and OSS: business model: subscription/SaaS/consulting. Practical control: server access control/OSS license; not much a pirate copy will do for you. Copyright salience: medium. Antipiracy efforts shifted to antifraud—scammers would purport to sell subscriptions. Open source was a different path—add consulting services to OS or build services using OS. That does depend on © but the most prevalent ©-based model was making software as accessible as possible and using © to ensure it was only used/redistributed in certain ways.

2010s: mobile era/app ecosystem. Business model: app store sales/subscriptions—you can, as in the 80s, get paid for a copy. Practical control: platform control/cloud services. © salience: low.

2020s: AI. Business model: ?? Practical control: ?? Copyright salience: None? [Real underpants gnomes vibes.] More software will be developed by more people than ever before. The tools allow people of all kinds to make software, and they allow software to make software. Maybe we are back where we started before CONTU with unclear © coverage.

Clark Asay, BYU Law: reasons for concern, but countervailing forces/reasons for optimism. FOSS licenses presuppose copyrightable code: copyleft, attribution, etc. W/o © the governance architecture becomes much less reliable. In the context of other developments that threaten open source—MongoDB and Elastisearch have abandoned OS; monetization has always been a question for companies that can’t directly monetize software. AI agents: those agents are creating tons of software and making pull requests/contributions to OS products w/o human review, which are being overwhelmed in some cases. Some projects are closing off in response. Open collaboration norms may be eroding from multiple directions simultaneously.

Might push us more in direction of trade secrecy and possibly patents. A more closed, fragmented software ecosystem and possibly AI system. But developers desire to influence the AI stack, which is likely to keep the ecosystem at least partially open.

A. Feder Cooper, Yale University (co-author Mark Lemley, Stanford Law School): Model weights that give a possibility but not a certainty of generating infringing output: is that a “copy”? Relates to memorization debate. It’s common to describe models as learning statistical correlations or patterns: that’s not wrong but it oversimplifies how info is represented. Another important part: how the LLM is used. Some methods of selecting outputs are deterministic—same input, same output; many are stochastic. Variability in outputs doesn’t derive from model but how the model is used in decoding.

Memorization is, when based on training, the model produces really high concentration of probability on particular sequences. The model is still probabilistic, but the distribution is so sharply peaked that one sequence (or small number of sequences) dominates. This is related to compression: memorization means that Ted Chiang’s “blurry jpg of the web” is sometimes not blurry at all for certain chunks. Memorization is pretty mysterious still—keeps giving new insights about LLM behavior. Not a bug; it’s far too interesting and complicated.

What is a copy? The statute’s answer is pretty incoherent: copies are material objects in which a work is fixed. (The “by or under the authority of the © owner” can’t be taken seriously for infringement by copying. We used the same definitions for protectability and infringement, so courts just ignore that part for infringement.) In litigation, parties take extreme positions—no memorization, or models are just a collage. Neither of these are right and sometimes not even partially right.

We can extract a near reproduction of Harry Potter from a short prompt from Meta’s Llama: that prompt is deterministic. That’s an extreme result—extraction is possible from some models for some works and not others. Most of our experiments measure whether verbatim memorization is occurring; we can get more if we accept small changes like extra spaces or commas in place of semicolons. Sometimes we needed adversarial strategies but sometimes not. None of that work changes model weights, but you can also do that to extract more works.

Jane Ginsburg et al. have shown that fine-tuning on public domain works can reveal memorization from previously-trained-on © works.

So is a model a copy fixed in a tangible medium of expression? That’s still complicated! You can make a copy by storing parts in ones & zeros. But you can’t say that Microsoft Word encodes War & Peace. Models aren’t like either of those things. Some of the memorization isn’t deterministic—you might only get a memorized copy one in 1000 times. Are the other 999 “stored” in the model? That would involve more copies stored than there are atoms in the universe.

Closest examples in existing law: Kelly v. Chicago Park District—garden isn’t fixed b/c it isn’t deterministic; video games where content is generated from a number of fixed options. Micro Star: the new levels aren’t really “in the game.” Nor would we say that all the possibilities currently exist. So maybe the answer is predictability: if the model weights can easily generate the work, functionally there’s a copy in the model. If it’s merely possible to extract the work through effort, it’s not a copy. Why it matters: if there’s a copy in the model, then copying the model is making a copy of the work. Maybe that’s fair use (via intermediate use) but we’d have to figure it out.

Doesn’t love the conclusion, but this is where the empirical evidence leads.

Samuelson for Sigall: you didn’t say much about patents—Whelan might be affected by the idea that patents weren’t available; then patents started becoming available, making thick © less attractive.

Sigall: late 90s was a marriage of two historical trends: if you want to go the IP route for software, patents might be more efficient/useful b/c there’s also a risk with seeking ©. Patents and © come with embedded strategic choices about your business. Book: Capitalism w/o Capital: many of the most successful companies today have intangible assets, not tangible assets—a lot of the benefit is taking advantage of synergies and spillovers in intangible assets. IP can interrupt and interfere w/those synergies & spillovers so it might not be optimal—businesses can capitalize on other aspects instead of IP.

Samuelson for Asay: what do you do w/the Office’s policy requiring you to ID the parts that are AI-generated and disclaim authorship? Will people do that or just pretend that they authored the whole thing?

Asay: Unworkable! Possible that developers will just continue as usual and ignore © complications, slapping license on even if code is AI-generated; that’s somebody else’s problem.

Sag: how do you deal with misuse of your work as evidence that LLMs don’t learn, they copy?

Cooper: Not great feeling! The research I do is careful and the papers are long; that’s not an accurate gloss of what models are doing. But it’s important to do the work to show information about model behavior that we didn’t know before.

Q: is Harry Potter an outlier given how many copies there are online?

A: It’s astonishing still to get a book from a fragmentary prompt; not all models do this and certainly not all the time, but other books can be derived; it’s hard to connect the dots from training data. Tried to do it with Coates’ “The Case for Reparations”—also got that from the same model—it’s very famous but not HP famous.

Cathy Gellis: isn’t © a background assumption for these business models even if you aren’t “relying” on it? If © didn’t exist, would these business models work?

Sigall: it’s a behavioral Q—what behavior is © shaping and it’s certainly possible that affects what businesses do with particular software. It’s there, but the Q is how do you use that fact as a business in your strategic choices? Microsoft housed its antipiracy department in the marketing department, not legal, because the goal wasn’t really to stop piracy but to get them to use Microsoft software. Other industries put antipiracy efforts in legal. Trying to understand actual behavior of users of their works and adapt to that. [This may also be relevant to the shift to streaming video/music!]

Brauneis: suggests that Office’s disclosure form isn’t onerous; doesn’t require you to ID which lines are AI-generated, so you should disclose and figure it out later.

Asay: may be true, but issue in the industry is norms/perceptions about copyrightability—that’s more important to behavior than technicalities of registration. [So what he’s saying is that coders have … always gone on vibes?]

Samuelson: A bit of an old problem with SaaS. Oracle started with a PD work and then made a derivative work from it; trying to sort which parts were protected from which weren’t was already a task.

Bracha: you said that you were wrong about sui generis protection for software because after that didn’t happen, courts rolled up their sleeves and did their job of developing relevant principles. Do you think that courts would do the same thing today?

Samuelson: good point—we sort of got sui generis protection w/in copyright.

Nimmer: works that incorporate works from the USG should in theory disclose that, even if it’s a paragraph quote; they don’t and it’s been a nonissue. So it could also work for AI.  

No comments: