Revealed: The Authors Whose Pirated Books Are Powering Generative AI

I can't believe they nonchalantly resorted to piracy in such a massive scale for profit.

If a normal person did that he would be locked in a cell for decades

Well, lots of normal people due this not for profit, which is just as damning in the eyes of copyright.
But what if they had done this in a legitimate fashion - say they got a library account and just ordered the books one by one, read them in, and then returned the books. As I understand it (which is not very well, tbh) the LLM don't keep a copy of the original reference. They use the works to determine paths and branches in what I assume is a quasi-statistical approach (ie stable diffusion associates characteristics with words, but once the characteristics are stored in the model the original is effectively discarded and can't actually be recreated, except in the way a child might reproduce a picture from memory.)
If the dataset is not, in fact, stored, would the authors still have a case?
- I believe this should be allowed, honestly. For, it's dangerous to disallow. I mean, there are dictatorships training their AIs, and they won't care about copyrights. That's gonna be an advantage for them, and the west should feed the same information.
  We don't need to allow Steven King, but scientific and engineering articles, sure.
OpenAI's gonna redo the training.
That said, it's concerning that dictatorships can feed more data to their AIs because they don't care about ethics. At some point their AIs might outperform western ones.
Here comes an unpopular opinion, but for the greater good we might be eventually forced to allow those companies to feed everything.
- Dictatorships (or any otherwise ideology driven entities) will have their very own problems training AI. Cannot feed the AI material which goes against your own ideology or it might not act in your best interest.

Meh. Judging by how publishers work, if AI companies were forced to pay, it's guaranteed that almost all the money will go to publishers instead of authors.

Ding, ding, ding. We have a winner!
Big publishers, sure.
But my publisher is an ordinary human being. I have been to their house and met their family. They are not rich, in fact they are struggling.
I have also released some work under Creative Commons, so for that stuff, I am my publisher.
But, there are AIs out there that I don't want to be trained on my work. The abomination that Palintir are building to wage war on people springs to mind. I don't want anything to do with that.
In many jurisdictions we are allowed to assert moral rights over our work. This is the right for it to not be mutilated or perverted or attributed to someone else. In my view, moral rights should extend to assimilation by AI models.
I can't stop individual people who I think are bad people from reading it, sure, but surely I should get a say in whether massive corporations who I think are evil can use it for their own enrichment.

See, I thought this was well-known when Books3 was dumped online in 2020.

https://twitter.com/theshawwn/status/1320282149329784833

This is even referenced in the article.

I guess maybe people were shocked it was really "all of Bibliotik" because they couldn't believe someone could actually manage to keep a decent share ratio on that fucking site to not get kicked off, especially while managing to download the whole corpus. /s (I don't know this from personal experience or anything.)

In all seriousness, however, it's been well known for a while now that these models were being trained on copyrighted books, and the companies trying to hide their faces over it are a joke.

It's just like always, copyright is used to punish regular ass people, but when corporations trash copyright, its all "whoopsie doodles, can't you just give us a cost-of-doing-business-fine and let us continue raping the public consciousness for a quick buck?" Corporations steal copyrighted material all the time, but regular ass people don't have the money to fight it. Hiding behind Fair Use while they are using it to make a profit isn't just a joke but a travesty and the ultimate in twisting language to corporate ends.

They may have bitten off more than they can chew here, though, possibly paving way for a class-action lawsuit from writers and publishers.

Seems like a clearly transformative work that would be covered under fair use. As an aside, I've been using AI as an writing assistant/solitary roleplaying GM for several years now and the quality of the prose can be quite good, but the authorship of stories is terrible and I can't say they even do a good job of emulating a particular author's style.
- Clearly transformative only applies to the work a human has put in to the process. It isn't at all clear that an LLM would pass muster for a fair use defense, but there are court cases in progress that may try to answer that question. Ultimately, I think what it's going to come down to is whether the training process itself and the human effort involved in training the model on copyrighted data is considered transformative enough to be fair use, or doesn't constitute copying at all. As far as I know, none of the big cases are trying the "not a copy" defense, so we'll have to see how this all plays out.
  In any event, copyright laws are horrifically behind the times and it's going to take new legislation sooner or later.
- Seems like a clearly transformative work that would be covered under fair use.
  People keep repeating this as me, but the thing is, I've seen what these things produce, and since the humans who created them can't even seem to articulate what is going on inside the black box to produce output, it's hard for me to be like "oh yeah, that human who can't describe what is even going on to produce this totally transformed the work." No, they used a tool to rip it up and shart it out and they don't even seem to functionally know what goes on inside the tool. If you can't actually describe the process of how it happens, the human is not the one doing anything transformative, the program is, and the program isn't a human acting alone, it is a program made by humans with intent to make money off of what the program can do. The program doesn't understand what it is transforming, it's just shitting out results. How is that "transformative."
  I mean, it's like fucking Superman 3 over here. "I didn't steal a ton from everyone, just fractions of pennies from every transaction! No one would notice, it's such a small amount." When the entire document produced is made by slivers of hundreds of thousands of copyrighted works, it doesn't strike me as any of it is original, nor justified in calling "Fair Use."
- The one use I've found for using AI is getting it to prompt me. I'd found myself between stories, unable to settle on an idea, but I had a rough idea of the kind of thing I was looking for, mostly determined by going down human-made prompts and going "nope, nope, nope, that's crap, that's boring, that's idiotic, nope... FFS why isn't there anything with X, Y, and Z?"
  So off I went to ChatGPT and said "give me a writing prompt with X, Y, and Z". What I got were some ideas that were okay, in that my response to them was more "okay, yeah, that's better than the output of r/WritingPrompts, but that plotline is still pretty derivative. Meh."
  And then something happened a couple days later. Something clicked and an actual good idea came to me, one that I felt was actually worth developing further.
  I absolutely would not want ChatGPT to do any writing for me. Not only would the end results be really derivative, but that's just not any fun. But there was definitely something useful in the process of asking it to echo X, Y, and Z at me so that I could refine my own ideas.
- I also struggle to see how authors are actually harmed by this use, which might be problematic for them in court.

So funny. One immediately can see this article is biased based on using the pejorative term of "Piracy". Let us please not use that term. It is either fair use or it is not and which it is has not been proven either way.

I find it ironic that OP posted an archive URL.

That occurred to me, but IIRC, The Atlantic has a paywall without Firefox shenanigans.