Generative AI faces existential crisis over copyright concerns

Generative AI faces existential crisis over copyright concerns

Generative AI has dominated the tech world over the last year. Criticism of this form of AI is plentiful, but rarely does it pose any threat to the continuous development of the technology. However, copyright claims pose an existential crisis for generative AI in its current form.

Right at the end of 2023, The New York Times (NYT) filed a lawsuit against OpenAI and its backer Microsoft. The allegation was that ChatGPT had been developed by infringing on the copyrights of millions of articles produced by NYT.

A key piece of evidence for this was that the chatbot could reproduce the content of such news articles when asked. NYT at one point considered a deal to retrospectively sell a license to OpenAI for its own archive, but negotiations evidently did not go as hoped. Now, the two sides are diametrically opposed. Microsoft, which makes prolific use of OpenAI technology within its own portfolio, is also court-bound over the NYT claim.

Fair use defense

Millions of NYT articles were allegedly part of the giant dataset to train models such as GPT-3, GPT-3.5 and GPT-4. Although ChatGPT is free to use, OpenAI managed to turn an annual revenue of $1.3 billion by commercializing these AI models through ChatGPT Plus, Enterprise and countless partnerships with third parties. The New York Times is not looking for a specific sum of money to be compensated, but the fact that Gen AI has such a lucrative potential is important to note.

Why? Because OpenAI defends itself by claiming that training AI models falls under “fair use”. Numerous agencies and authorities endorse this, according to the company. There is also an opt-out for publishers by blocking GPTBot, OpenAI’s scraping tool that collects training data from the web. ‘Fair use’ is a loose term, though, which is usually mentioned to protect rights surrounding criticism and critique, research and reporting. For example, a review can share passages or excerpts from a book, movie or other type of media to make a point. They’re transformative in nature, as they do not replace the act of experiencing the work in question.

That’s not how OpenAI deploys fair use in practical terms. The company implies that any publicly available work is appropriate for training purposes, simply because it’s publicly available. As the technology converts training data into new data, an AI application does not replace original work, the thinking goes. Whenever an original work does get copied wholesale, OpenAI considers that a bug.

It’s a very broad interpretation of fair use. Since OpenAI has already purchased licenses to use The Associated Press archives and Axel Springer content, it is not likely that even the company itself actually believes it. Rather, it seems to be making this argument out of legal considerations. Incidentally, the opt-out to not share content (“the right thing to do”) was not revealed until months after GPT-4 was introduced. The announcement was tucked away in an API subpage, so it’s not as if OpenAI was only too eager to share this capability with publishers.

Replacing the original work?

The New York Times is not the only party to have taken legal action against OpenAI, among others. A class-action case (supported by comedian Sarah Silverman, among others) has been ongoing since July 2023, while two authors also went to court a few days ago to join NYT in accusing OpenAI and Microsoft of copyright infringement in a separate case.

A key NYT argument is that OpenAI can use the unlicensed content itself to compete for news coverage. ChatGPT Plus does search for sources via Bing and extract information from them, but the ability to translate this into meaningfully worded content depends heavily on the training data. In essence, it was taught how to do so in part by ingesting news articles from the past. In doing so, NYT would also have contributed to the AI search engine functionality of Microsoft Copilot (formerly Bing Chat) with high-quality background information.

A court decision could be hugely influential in this. If OpenAI is to be forced to license content, it likely means that it would lose a significant portion of its profits. After all, it is currently said to offer only between 1 and 5 million dollars to publishers, in part because OpenAI believes it was already allowed to use this content in the first place.

Bigger implications

OpenAI argues that generative AI is not viable without using copyrighted material. As the technology currently functions, that certainly seems to be the case. Indeed, all AI applications that rely on OpenAI technology rest on the training data collected by that company. There is no GPT model for which that is not the case.

It can be argued that Gen AI in its current form is therefore undesirable – it would be fair to call it a copyright quagmire as things stand. Having alleged IP theft approved after the fact is not a tenable model. It’s rewarding companies for exploiting grey areas in the law as much as possible until regulations kick in to prevent the behaviour in question. In addition, the complete lack of transparency around training data is so great that authors don’t even know for sure if (and how much) content of theirs has been used. Rules around AI transparency and copyright protection are already being developed, thankfully. The EU is busy working on the AI Act, which is expected to constrain parties such as OpenAI, Microsoft and Google from continuing to ignore copyright concerns.

Should OpenAI be forced to license training data, it will not necessarily have only positive consequences. OpenAI technology is considered the state-of-the-art compared to the competition, whether that competition is coming from Google or the open-source community. Curtailing that progress also means curtailing the transformative power of Gen AI. That is a choice that shouldn’t be unilaterally decided on by a judge.

Above all, regulatory clarity is needed. Gen AI’s opaque data sets and potentially copyright-violating behaviour are not sustainable for commercial deployment. Companies need to know what information is collected for AI training and what models are basing their outputs on. Clear regulation from the EU, US and other parties can be decisive in this, ensuring the financial benefits of AI are shared by all parties involved. Then, it’s just a matter of deciding how much OpenAI needs to pay for one’s own content.