Generative AI will eventually poison itself

I am quite pleased the AI decided to take it to heart when I told it to kill itself

Well, it’s also killing the internet in the process. It’s like a tumor on the internet.
- Nah. It's degrading the internet, for sure; but not killing it. We got a similar event in September 1993 and the internet survived fine.
It won't die, it will just plateau. At least for now.

This article is grossly overstating the findings of the paper. It's true that bad generated data hurts model performance, but that's true of bad human data as well. The paper used opt125M as their generator model, a very small research model with fairly low quality and often incoherent outputs. The higher quality generated data which makes up a majority of the generated text online is far less of an issue. The use of generated data to improve output consistency is a common practice for both text and image models.

Tbh I think you're making a lot of assumptions and ignoring the point of this paper. The small model was used to quickly show proof of generative degradation over itérations when the model was trained on its own output data. The use of opt125 was used precisely due to its small size so they could demonstrate this phenomenon in less iterations. The point still stands that this shows that data poisoning exists, and just because a Model is much bigger doesn't make sense that it would be immune to this effect, just that it will take longer. I suspect that with companies continually scraping the web and sources for data, like reddit which this article mentions has struck a deal with Google to allow their models to train off of, this process will not in fact take too long as more and more of reddit posts become AI generated in itself.
I think it's a fallacy to assume that a giant model is therefore "higher quality" and resistant to data poisoning
- Is it being poisoned because the generated data is garbage or because the generated data is made by an AI?
  Using a small model let's it be shown faster but also means the outputs are seriously terrible. It's common to fine tune models on gpt4 outputs which directly goes against this.
  And there is a correlation between size and performance. It's not a rule per say and people are working hard on squeezing more and more out of small models, but it's not a fallacy to assume bigger is better.

I've been calling this for awhile now.

I've been calling it the Ouroboros effect.

There's even bigger parts at play the paper didn't even dig into, and that's selective bias dye to human intervention.

See at first let's say an AI has 100 unique outputs for a given prompt.

However, humans will favor let's say half of em. Humans will naturally regenerate a couple times and pick their preferred "cream of the crop" result.

This will then ouroboros for an iteration.

Now the next iteration only has say 50 unique responses, as half of them have been ouroboros'd away by humans picking the one they like more.

Repeat, each time "half-lifing" the originality.

Over time, everything will get more abd more sameish. Models will degrade on originality as everything muddles into corporate speak.

You know how every corporate website uses the same useless "doesn't mean anything" jargon string of words, to say a lot without actually saying anything?

That's how AI is going to local minima to, as it keeps getting selectively "bred" to speak in an appealing and nonspecific way for the majority of online content.

I mean that's kind of how chatgpt is now, I've been slowly getting into llama, can't quite get it as good as gpt yet but I'm learning

No, it won't.

A number of things:

When the cited paper came out it was the first looking at this. And even with just 10% of the original data the effects were mitigated.
Since, another paper found a mix of synthetic and organic data is the best performing mixture.
The quality of the models producing the synthetic data matters a lot.
Other research has found huge benefits in training models with synthetic data from SotA models.
Models are only getting better, meaning the quality of synthetic data will be improving.

It only leads to collapse if all organic data representing long tails of data variety disappear. Which hopefully throws water on x-risk doomers as AI killing humanity would currently be a murder suicide.

Are they talking about inbreeding when most of the online content is AI generated and AI starts training on other AI data?

good riddance i say

This was evitable with how much money and speed was spent to get these up and running. As they get more widespread they will just poison each other and overfit like crazy. Hopefully it will happen sooner than later

i miss when we had kept gpt unpublished because it was “too dangerous”. i wish we could have released it in a more mature way.

because we were right. we couldn’t be trusted and immediately ruined the biggest wonder of humanity by having it generate thousands to millions of articles for a quick buck. toothpaste is out of the tube now and it can never go back in.

Someone would have made one eventually. Unless the government monitors every computer in existence, AI is inevitable.
- it’s not the “making one” that’s a problem. it’s the making, optimizing and rabid marketing of one in the service of capital instead of humans.
  if only a bunch of open source, true non-profits released language models, the landscape might still suck but would be distinctly less toxic.
  and if the government (or even a decently sized ngo standards entity) had worked proactively with computer scientists to find solutions like watermarking, labor replacement protections, and copyright protections, things might be arguably perfect. not one of those things happened and so further into the hellscape we descend.
- And just to make it clear, we should not give the government the ability to monitor every computer in existence, or even any computer not owned by them.

HOT POCKETS FUCK GUCCI LOAFERS HOT POCKETS SEED CLEAN IT UP FLOYD FLOYD FANCY GERMAN CAR JANNY HOT POCKETS ON LEMMY SNEED'S I CAN'T SNEED'S DILATE FANCY GERMAN CAR JANNIES SEETHE SEED SEETHE CHUCK CHUCK FOR FREE SNEED SEETHE CHUCK CLEAN IT UP SEETHE HOT POCKETS SEED CHUCK'S SNEED SEETHE FEED CHUCK SUCK GUCCI LOAFERS AND JANNY FOR FREE FLOYD FLOYD FUCK DILATE CHUCK'S ON LEMMY DILATE FEED PARK AVENUE MANICURE

AND HOT POCKETS CITY SLICKER FANCY GERMAN CAR FUCK FOR FREE COPE I CAN'T SNEED'S FANCY GERMAN CAR I CAN'T DILATE CITY SLICKER SNEED'S SNEED CITY SLICKER COPE PARK AVENUE MANICURE AND SEETHE GUCCI LOAFERS CHUCK'S JANNIES SUCK GUCCI LOAFERS CITY SLICKER JANNIES DILATE HOT POCKETS DILATE JANNIES ON LEMMY FANCY GERMAN CAR FOR FREE SEED CLEAN IT UP FEED JANNY PARK AVENUE MANICURE COPE CHUCK'S CITY SLICKER FLOYD I CAN'T SEETHE FUCK FANCY GERMAN CAR GUCCI LOAFERS SNEED'S FLOYD