Programmer and lawyer Matthew Butterick has sued Microsoft, GitHub, and OpenAI, alleging that GitHub's Copilot violates the terms of open-source licenses and infringes the rights of programmers.
GitHub Copilot, released in June 2022, is an AI-based programming aid that uses OpenAI Codex to generate real-time source code and function recommendations in Visual Studio.
The tool was trained with machine learning using billions of lines of code from public repositories and can transform natural language into code snippets across dozens of programming languages.
Clipping authors out
While Copilot can speed up the process of writing code and ease software development, its use of public open-source code has caused experts to worry that it violates licensing attributions and limitations.
Open-source licenses, like the GPL, Apache, and MIT licenses, require attribution of the author's name and defining particular copyrights.
However, Copilot is removing this component, and even when the snippets are longer than 150 characters and taken directly from the training set, no attribution is given.
Some programmers have gone as far as to call this open-source laundering, and the legal implications of this approach were demonstrated after the launch of the AI tool.
"It appears Microsoft is profiting from others' work by disregarding the conditions of the underlying open-source licenses and other legal requirements," comments Joseph Saveri, the law firm representing Butterick in the litigation.
To make matters worse, people have reported cases of Copilot leaking secrets published on public repositories by mistake and thus included in the training set, like API keys.
Apart from the license violations, Butterick also alleges that the development feature violates the following:
- GitHub's terms of service and privacy policies,
- DMCA 1202, which forbids the removal of copyright-management information,
- the California Consumer Privacy Act,
- and other laws giving rise to the related legal claims.
The complaint was submitted to the U.S. District Court of the Northern District of California, demanding the approval of statutory damages of $9,000,000,000.
"Each time Copilot provides an unlawful Output it violates Section 1202 three times (distributing the Licensed Materials without: (1) attribution, (2) copyright notice, and (3) License Terms)," reads the complaint.
"So, if each user receives just one Output that violates Section 1202 throughout their time using Copilot (up to fifteen months for the earliest adopters), then GitHub and OpenAI have violated the DMCA 3,600,000 times. At minimum statutory damages of $2500 per violation, that translates to $9,000,000,000."
Harming open-source
Butterick also touched on another subject in a blog post earlier in October, discussing the damage that Copilot could bring to open-source communities.
The programmer argued that the incentive for open-source contributions and collaboration is essentially removed by offering people code snippets and never telling them who created the code they are using.
"Microsoft is creating a new walled garden that will inhibit programmers from discovering traditional open-source communities," writes Butterick.
"Over time, this process will starve these communities. User attention and engagement will be shifted [...] away from the open-source projects themselves—away from their source repos, their issue trackers, their mailing lists, their discussion boards."
Butterick fears that given enough time, Copilot will cause open source communities to decline, and by extension, the quality of the code in the training data will diminish.
BleepingComputer has contacted both Microsoft and GitHub for a comment on the above, and we received the following statement from GitHub.
"We’ve been committed to innovating responsibly with Copilot from the start, and will continue to evolve the product to best serve developers across the globe." - GitHub.
Comments
h_b_s - 1 year ago
I remind everyone in the US and informing people elsewhere that the only arbiters of copyright infringement in the US are the US federal courts for works after 1976. Either this falls under the Fair Use exemptions (and no license is needed) or it doesn't, in which case the license grants and relevant laws matter and untangling them is the job of the judge/jury. Without a license there is no grant to use anything covered by copyright under US law aside from Fair Use. Just because there is no stated license doesn't mean a covered creation is free to use, quite the opposite. I also remind people that since Oracle v. Google all parts of source code fall under copyright protections, not just the body of a program, but header files that define APIs as well.
There is a well established 4 point test for Fair Use. I wouldn't hold my breath Copilot would meet all four points as required (use can't just meet one or two, it must meet all four).
https://fairuse.stanford.edu/overview/fair-use/four-factors/
"Open-source licenses, like the GPL, Apache, and MIT licenses, require attribution of the author's name and defining particular copyrights." is misleading. I'm pretty sure the GPL v2 and 2 clause BSD licenses do not require attribution of the authors, but they do require retention and notification of the terms of copyright licensing grants. While they don't require author attribution, it's common practice to leave author names in for the sake of courtesy, history, and for maintenance or questions.
It's also a requirement for the use of GPL software for any version of the license to provide access to any downstream changes, the license notice, and access to the original source tree to users of any resulting binaries. It's also problematic for Microsoft that to change licensing agreements in the US requires the consent of all contributers to a project, not just the person that uploaded it. This is a HUGE can of worms that's been opened here.
TecArtScien - 1 year ago
Each time Copilot provides an unlawful Output it violates Section 1202 three times (distributing the Licensed Materials without: (1) attribution, (2) copyright notice, and (3) License Terms)," reads the complaint.
Why doesn't the open source community create it's own version of Copilot that doesn't violate the terms of use. I believe we would rather pay for the open source version of Copilot that pays the developers their money'tary dues.
Word22gamer - 1 year ago
Well one problem is that who would buy it if it’s already available as open source.
Second: let’s say you make some complicated software but don’t have the money to certify it. What better way to test it than to leave it to the open source community to find bugs and back doors in your code. One of the best cryptographic software packages are open source. https://wordmaker.info/how-many/copilot.html
Mike_Walsh - 1 year ago
Heh. This whole business of MyCrudSoft buying-out Github was always a bare-faced grab to profit from the freely-supplied work of others. You can't blame the beast for reverting to type. You might just as well try to stop the sun from rising; it's inevitable (and about as much inertia, too).
jack37ck - 1 year ago
In my opinion, nothing is copied to Github Copilot. There is afaik no database and no lookup index in play, except maybe for checking the generated results.
My understanding is that language models assess the statistical prevalence of a symbol as a function of other symbols. They do not require a store of predefined text or snippets for their operation.
Statistical analysis of patterns should not be considered a derivation of those patterns, as that would have consequences for learning in general. Most of the higher knowledge you have acquired has been learned from copyrighted material.
And yes, it would be possible to exempt human learning from this because it is flawed (cf. "Bicentennial Man"). But what about scientific reasoning, based on other sources? Where does "human error" end?
There is a reason why running machine learning models is called inference. Inference is intended to replicate the human/natural "thought process." To disregard it as a copy has serious consequences for "human thought" as well.
Anyway, this is my opinion as I am no lawyer, but a hobby philosopher.
If a painter were to paint something indistinguishably in the style of van Gogh, he would be considered a master of his craft, but if an AI does the same, it is considered a fraud, a thief, and a copyright infringer. Quite a double standard, isn't it?