The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work
The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work

The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work

Millions of articles from The New York Times were used to train chatbots that now compete with it, the lawsuit said.
The existing industry that's popped up around LLMs has conveniently ignored that what these models are doing may have been illegal the whole time and a lot of the experts knew it. This is why it's so important for folks to realize that the industry is not just thin wrappers around ChatGPT (and that interesting applications of this technology are largely being pushed out by the lowest hanging fruit). If this is ruled as not fair use then the whole industry will basically disappear overnight and we'll have to rebuild it from scratch either with a new business model that pays authors or as open source/crowd sourced models (probably both). All that said we're almost certainly better off. Open AI may have kicked off the most recent "gold rush" but their methods have been terrible for both the industry at large and for further development of the tech.
It always should have had the right business model where they paid for this access for AI training. They knew it was wrong but in their rush to be known they decided it was better to take without asking and then ask for forgiveness later. Regardless what happens now, people have already made a name for themselves swindling the likes of Microsoft out of it and will have long well-paying careers from it.
It seems like it was almost necessary to go through this phase for the sake of developing the tech. Doesn't a lot of CS research uses web crawling algorithms to gather data without identifying that the information is licensed for such use? What about the fediverse? it remains unclear what the copyright and licensing will be should it come into question. There is no EULA to access fedi, just a set of open protocols.
I seem to remember NYT suing Google years ago for effectively the same thing. Google copies all NYT articles into it's index, then sells ads for people to search for that copyrighted information.
This is a fair point with regards to a handful of companies (Microsoft, Google, Meta) but there will still be an immediate loss in quality as they go back to basics on their data pipelines. Given how long they've spent playing catch up in this space, I suspect progress will be pretty slow from there
This is cute naiveté. This case will drag on for years and eventually be settled behind closed doors.
These models can still be trained on data that they're allowed to use, but I think that what we're seeing is that the better LLM services are probably trained with shocking amounts of private data, whereas the less performant probably don't use stolen data.
Textbooks are a big one that I suspect we'll probably see a set of suits over. Particularly because they seem to be some of the most valuable training data.
It certainly seems illegal, but if it is, then all search engines are too. They do the same thing. Search engines copy everything to their internal servers, index it, then sell access to that copyrighted data (via ads and other indirect revenue generators).
Definitely not. Search engines point you to the original and aren't by any means selling access. That is the resources are accessible without using a search engine. LLMs are different because they do fold the inputs into the final product in a way that makes accessing the original material impossible. What's more LLMs can fully reproduce copyrighted works and will try to pass them off as it's own work.
If the companies can profit of stealing work and charging access for it, why can't I just pirate it myself without making anyone richer?
Why would the NYT pay the authors again?
I don't see why we would be better off if the NYT and other newspapers get a windfall profit. I don't see the reasoning here at all.
ETA: 6 downvotes so far. Would anyone mind explaining what the problem is? I'm not lying when I say that I don't see it.
The NYT as a company is much closer to its authors than AI is to its authors. When it exercises copyright, the owners of those copyrights are the NYT, but the authors are the ... Well, authors. You're right that a victory means newspapers get a lot more money.
... But would that be a bad thing? If newspapers become more profitable again, maybe we can see a resurgence of local papers and more reliable news. Instead of MSNBC and Fox and CNN, various papers could be our main media sources.
In any case -- there's times when business interests align with employee interests, and this is one of them. The NYT is effectively saying with this lawsuit that OpenAI et al. have been stealing from them, and by proxy, the authors. A victory in this court case would strengthen author rights and ownership. A loss would mean big corporations can take anything made by the public, use it for their AI, and then charge money for it. The training materials have a quantifiable value in what a trained model sells for versus an untrained model.
Pretty sure you were downvoted because it looks like you've misunderstood. The NYT do, in fact, pay their authors.