This is a misunderstanding on your part. While some neurons are trained this way, word2vec and doc2vec are not these mechanisms. The llms are extensions of these models and while there are certainly some aspects of what you are describing, there is a transcription into vector formats.
This is the power of vectorization of language (among other things). The one to one mapping between vectors and words / sentences to documents and so forth allows models to describe the distance between words or phrases using euclidian geometry.
This wasn't even hard... I got it spitting out random verbatim bits of Harry Potter. It won't do the whole thing, and some of it is garbage, but this is pretty clear copyright violations.
I'm sorry, but I can't provide verbatim excerpts from copyrighted texts. However, I can offer a summary or discuss the themes, characters, and other aspects of the Harry Potter series if you're interested. Just let me know how you'd like to proceed!
That doesn't mean the copyrighted material isn't in there. It also doesn't mean that the unrestricted model can't.
Edit: I didn't get it to tell me that it does have the verbatim text in its data.
I can identify verbatim text based on the patterns and language that I've been trained on. Verbatim text would match the exact wording and structure of the original source. However, I'm not allowed to provide verbatim excerpts from copyrighted texts, even if you request them. If you have any questions or topics you'd like to explore, please let me know, and I'd be happy to assist you!
Here we go, I can get chat gpt to give me sentence by sentence:
"Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much."
I don't know if I agree with everything you wrote but I think the argument about llms basically transforming the text is important.
Converting written text into numbers doesn't fundamentally change the text. It's still the authors original work, just translated into a vector format. Reproduction of that vector format is still reproduction without citation.
You made two arguments for why they shouldn't be able to train on the work for free and then said that they can with the third?
Did openai pay for the material? If not, then it's illegal.
Additionally, copywrite and trademarks and patents are about reproduction, not use.
If you bought a pen that was patented, then made a copy of the pen and sold it as yours, that's illegal. This is the analogy of what openai is going with books.
Plagiarism and reproduction of text is the part that is illegal. If you take the "ai" part out, what openai is doing is blatantly illegal.
Java doesn't have to declare every error at every level... Go is significantly more tedious and verbose than any other common language (for errors). I found it leads to less specific errors and errors handled at weird levels in the stack.
Also Go: exceptions aren't real, you declare and handle every error at every level or declare that you might return that error because go fuck yourself.
I have a brother laser, cost me 80bucks. Had to replace my toner once, after about 4000 pages. Cost me 34 bucks to get a new toner. Another 2000 pages in. It just doesn't stop. Unplug it. Leave it unplugged for a month or two. Plug it in, wait a couple minutes, wireless print 50 pages with no driver installs. Unplug.
I can't even imagine not having a ci pipeline anymore. Having more than a single production architecture target complete with test sets, Security audits, linters, multiple languages, multiple hour builds per platform... hundreds to thousands of developers... It's just not possible to even try to make software at scale without it.
I'm sorry you failed to grasp how it works in this context.