So let’s get right down to it: If you use AI, you are a thief. If you don’t use AI and you have published content online that can be accessed without going through some sort of paywall, you have probably been robbed. Does this sound extreme? Well I’m truly sorry to say that it isn’t.
To those of you in the know, these statements are likely not going to shock you. The rest of you are likely just shaking your heads in disbelief. So let’s get into the nitty gritty of this. The first step in understanding how I can make such a broad statement and legitimately claim that its accurate is to understand how copyright works:
Copyright is a type of intellectual property that protects original works of authorship as soon as an author fixes the work in a tangible form of expression. In copyright law, there are a lot of different types of works, including paintings, photographs, illustrations, musical compositions, sound recordings, computer programs, books, poems, blog posts, movies, architectural works, plays, and so much more!
…
Everyone is a copyright owner. Once you create an original work and fix it, like taking a photograph, writing a poem or blog, or recording a new song, you are the author and the owner.
Okay so copyright is a benefit that is afforded to anybody who creates anything that can even be remotely considered to be original. You don’t have to do anything to receive the benefit of copyright protections for your work. It is a default state. The other keyword in the explanation above is “protect”. So what does that mean precisely? Let’s delve into that by reading further:
U.S. copyright law provides copyright owners with the following exclusive rights:
- Reproduce the work in copies or phonorecords.
- Prepare derivative works based upon the work.
- Distribute copies or phonorecords of the work to the public by sale or other transfer of ownership or by rental, lease, or lending.
- Perform the work publicly if it is a literary, musical, dramatic, or choreographic work; a pantomime; or a motion picture or other audiovisual work.
- Display the work publicly if it is a literary, musical, dramatic, or choreographic work; a pantomime; or a pictorial, graphic, or sculptural work. This right also applies to the individual images of a motion picture or other audiovisual work.
- Perform the work publicly by means of a digital audio transmission if the work is a sound recording.
The keyword in the explanation above is “exclusive”. This is the default state which means that these protections that are applied to anything you create. This system has allowed millions of artists and authors and musicians to make a living off of their creative works and arguably works pretty well in its original form. Note: I’m specifically avoiding talking about the length of copyright protections as this is a much abused portion of the system that companies like Disney have hijacked in an effort to prolong their existence amid an ongoing multi-decade drought of creative original expression on their part.
Okay so what does this have to do with AI? Well it has everything to do with AI as it turns out. That’s because AI needs to be trained and the only way to do that is to expose these Large Language Models (LLMs) to a wide variety of content. Without that content a LLM is a pointless abstract coding concept capable of producing nothing. The generally accepted answer to the question, “How much content do you need to train an AI” is of course “the more the merrier”.
AI companies and their products are very data hungry. This is why companies like Reddit, StackOverflow and others have been waging legal battles and creating new rules and restrictions around who can harvest their content and for what purpose. But here is the thing: That content doesn’t belong to those companies. It’s your content. It’s my content. It doesn’t belong to the companies. But they host it and like any other skeevy middleman, they want to receive a cut of everything that happens here.
So how can they get away with this? Buried within the licensing agreements of all of these platforms (aka that big long textbox full of legalese shit nobody reads when they first sign up for any online service) is some clause in which you as an end user agree to grant the hosting service some sort of license to your original creative works. This is because in order to legal conform to copyright law, they legally need some sort of license that you agree to in order to be able to legally reproduce your work for other users to experience. That license may or may not be broad enough to include training LLMs. If it wasn’t, they have probably updated the license agreement in the last few years so that they are granted that right.
Okay so what’s a copyright license? Well we have effectively already answered the question indirectly but describing how these companies secure rights to do things with the content you produce for their platforms, but let’s be more explicit about it:
A copyright license gives a person or entity (“licensee") the authorization to use a work from the copyright owner, usually in exchange for payment. Copyright licenses may be exclusive or nonexclusive, and the rights that come with them vary according to the specifics of each license.
So the important takeaway here is that a copyright license grants some or all consumers of a work additional rights. Because my default outside of fair use (which is complex and outside the scope of this discussion), end users don’t have a lot of rights in terms of what they can do with other people’s copyrighted works. This concept is why software companies are able to legitimately claim that they don’t sell software, but rather they sell licenses to software. While shitty from an end user perspective, this is a legal concept that is at the very core of copyright law.
TLDR: If you don’t explicitly or implicitly license your work, other entities, whether they be people or corporations, aren’t allowed to use it to create derivative works.
Okay, so what’s the problem and how does it relate to ITF? Well as it so happens, my employer has recently begun deploying an AI coding assistant that runs “on premise” and is built upon data obtained from “The Stack V2” archive. I’m not going to reveal the name of the vendor and the product, but suffice to say that vendor guarantees that this version of their product is only trained using permissively licensed github repositories (which basically means code repos that have licenses which do not require derivative works to also be open sourced using the same or a compatible license - a concept known as copyleft).
This is of importance to my employer as we produce proprietary software in the form of cloud services that we sell access to. Management picked this product because it’s claims where superior to most other products in this regard and they could run it in their own private VM somehow and not have to worry about a data hungry centralized AI entity training itself on our internal unlicensed proprietary code. They also liked the fact that it could be additionally trained on our own code repositories.
I was included in one of the early testing groups but alas as I was already opposed to these things and AI in general, I did not participate. However a few weeks into this while really stoned, I became infected with a specific thought: Is this vendor lying about how they train their repos? Surely they must be… right? But I figured that at some point they must be providing enough specific data to validate their claim so that potential customers wouldn’t just assume they were lying. So I headed over to their website to find out.
What I found was shocking at the time, but from a more sober perspective (aka not stoned), it isn’t really a surprise. Long story short: The vendor provides dozens of CSV files for download in their public document which makes the claims that their “private” models were only trained on permissively licensed repositories. I downloaded all of them and imported all 60 million plus of them into a sqlite database. Then I ran some queries and it turns out that they are lying. A lot. Over 58 million of those permissively licensed repos aren’t licensed at all and this is specified directly within their CSV files. This accounts for over 96% of the training content by repository count.
I performed some spot checks against some of them and my manual process verified what their CSV said, the unlicensed repos are either not obviously licensed or outright not licensed at all. I struggled with what to do next as the other engineers in the beta are tripping all over themselves in an effort to find ways to get this trinket to do more of their work for them based on the slack channel conversation. I didn’t want to be the guy who takes a shit in the punch bowl. After struggling with this over a weekend, I ultimately decided to share my findings because I felt like I had a responsibility as an employee to make them aware of the risks they are incurring by continuing to use this product.
Oh that’s right. We haven’t covered that, have we? What exactly are the risks?
The risk here is simple. Because the model we are using was trained with other people’s unlicensed code, the code it produces could arguably be considered to be a derivative work of that code. If one of the owners of these unlicensed repos ever discovered this useful little tool and realized that their code had been used to train countless AI coding assistant models, they would be well within their rights to sue the vendors pushing this garbage and then maybe even suing their customers.
For a proprietary cloud services company, that is a massive risk. Every single line of code this thing generates that we commit into our product just increases our overall level of liability and increases the chances that one day we’ll find ourselves on the hook for violating the copyright of other software engineers.
Okay, but I’m getting ahead of myself, right? Surely given the strength of my results and the specific data I was able to provide, the vendor and our executive team immediately took the required action to fix the problem, right?
Right?
Hell the fuck no. I made the case and I got down and dirty with it. Eventually the vendor decided to fallback on one of my favorite logical fallacies as their defense: An appeal to authority. That authority in this case is The Stack V2 itself and this particular bit of text on their main page (linked above):
License detection
We extract repository-level license information from GH Archive for all repositories with matching names in the SWH dataset. When the repo-level license is not available, i.e., for 96.93% of repositories, we use the ScanCode Toolkit to detect file-level licenses as follows:
- Find all filenames that could contain a license (e.g., LICENSE, MIT.txt, Apache2.0) or contain a reference to the license (e.g., README.md, GUIDELINES);
- Apply ScanCode’s license detection to the matching files and gather the SPDX IDs of the detected licenses;
- Propagate the detected licenses to all files that have the same base path within the repository as the license file.
So in response to that, I actually downloaded the referenced ScanCode tool and ran it against my unlicensed repository examples. The results were not promising. The first one got some license matches, but as it turns out, it was a PHP project that makes use of the composer package manager which happens to store the licenses of its dependencies within a composer.lock file located at the root of the repository. So all of those “matches” are complete bullshit. The second repository had no matches at all.
So the people behind The Stack V2 dataset have played very fast and loose with licensing because they are desperate for more training data. The vendors who make use of this dataset are also turning a blind eye to the issue, because without this lions share of unlicensed code to train against, their AI coding assistants likely will not be able to compete against the others out there.
In any event, the executives at my company haven’t done anything with this… yet. The vendor moved to close my ticket after blowing off my detailed response going through the ScanCode output for my example repositories and proving that they are in fact unlicensed. The executives at my employer have been very tight lipped since then. Meanwhile the Slack channel for the ai assistant rollout is abuzz with activity, questions, tips and tricks from foolish engineers who ought to know better but don’t.
So what’s my next move? Nothing. I won’t use this tool and no force in the company will change my position on that. However, I can’t change the fact that ongoing use of this tool by others is polluting our codebase and weakening my employer’s claim of exclusive copyright on what used to be “their own” code. I guess that will be some future person’s problem though because nobody currently employed there seems to give a shit.
What a fucking shame.