Academic Torrents has Reddit data up to December 2023. This data isn't live-updated, my understanding is that it's scraped when it's first posted. That's how services like removeddit worked, it would show the "original" version of a post or comment from when it was scraped rather than the edited or deleted version that Reddit shows now.
The age isn't really the most important thing when it comes to training a base AI model. If you want to teach it about current events there are better ways to do that than social media scrapes. Stuff like Reddit is good for teaching an AI about how people talk to each other.
I use quotation marks there because what is often referred to as AI today is not whatsoever what the term once described.
The field of AI has been around for decades and covers a wide range of technologies, many of them much "simpler" than the current crop of generative AI. What is often referred to as AI today is absolutely what the term once described, and still does describe.
What people seem to be conflating is the general term "AI" and the more specific "AGI", or Artificial General Intelligence. AGI is the stuff you see on Star Trek. Nobody is claiming that current LLMs are AGI, though they may be a significant step along the way to that.
I may be sounding nitpicky here, but this is the fundamental issue that the article is complaining about. People are not well educated about what AI actually is and what it's good at. It's good at a huge amount of stuff, it's really revolutionary, but it's not good at everything. It's not the fault of AI when people fail to grasp that, no more than it's the fault of the car when someone gets into it and then is annoyed it won't take them to the Moon.
Frankly, these NATO expansions and its general re-invigoration are a larger loss for Russia than anything they could possibly gain in Ukraine. Their Baltic fleet is now useless. Kaliningrad is useless.
Combined with all the other damage Ukraine has inflicted on Russia, they're basically spiralling the drain and I see no possible way Russia could rise in prominence in the future. Even if goodness forbid they were to "win" the current war they're fighting with Ukraine, that won't help them, it'll only hurt Ukraine.
Which is why nobody trains on ONLY AI generated data.
Really, experts have thought of this stuff already. Because they're experts. Synthetic data means that the amount of "real" data required is much less, so giant repositories like Reddit aren't so important.
Regardless, the content that's available through PS is the content that people are talking about overwriting or deleting. They can't edit or delete stuff that PushShift couldn't see in the first place.
In case you didn’t know, you can’t train an AI on content generated by another AI because it causes distortion that reduces the quality of the output.
This is incorrect in the general case. You can run into problems if you do it incorrectly or in a naive manner. But this is stuff that the professionals have figured out months or years ago already. A lot of the better AIs these days are trained on "synthetic data", which is data that's been generated by other AIs.
I've seen a lot of people fall for wishful thinking on this subject. They don't like AI for whatever reason, they hear some news article that says something that sounds like "AI won't work because of problem X", and so they grab hold of that. "Model collapse" is one of those things, it's not really a problem that serious researchers consider insurmountable.
If you don't want Reddit to use your posts to train AI then don't post on Reddit. If you already did post on Reddit, it's too late, you already gave them your content. Bear this in mind next time you join a social media site, I guess.
Not to mention that a response "containing" plagiarism is a pretty poorly defined criterion. The system being used here is proprietary so we don't even know how it works.
I went and looked at how low theater and such were and it's dramatic:
The lowest similarity scores appeared in theater (0.9%), humanities (2.8%) and English language (5.4%).
That's why I was suggesting such a simple approach, it doesn't require AI or machine learning except in the most basic sense. If you want to try applying fancier stuff you could use those basic word-based filters as a first pass to reduce the cost.
Somewhere in between, sure. But don't interpret that to mean that the most likely real number is exactly in the middle. I consider Russian numbers to be way less credible.
Another more general property that might be worth looking for would be substantially similar posts that get cross-posted to a wide variety of communities in a short period of time. That's a pattern that can have legitimate reasons but it's probably worth raising a flag to draw extra scrutiny.
One idea for making it computationally lightweight but also robust against bots "tweaking" the wording of each post might be to fingerprint each post based on rare word usage. Spam is likely to mention the brand name of whatever product it's hawking, which is probably not going to be a commonly used word. So if a bunch of posts come along that all use the same rare words all at once, that's suspicious. I could also easily see situations where this gives false positives, of course - if some product suddenly does something newsworthy you could see a spew of legitimate posts about it in a variety of communities. But no automated spam checker is perfect.
I've got a bunch of frozen mashed potatoes pre-divided into meal-sized tupperware. Microwave one of those and it's quite hearty.
I've also got a rice cooker and it's super easy to make something both substantial and tasty with one of those, dump in the rice and water and then add a tin of condensed soup as well. Push the button and come back later to dump it onto a plate. I've found most kinds of condensed soup work well, though avoid anything with "cream of" in the title as those can end up unpleasantly goopy.
No, that's not the concern here. He's getting job offers from new employers while he's midway through this personal project, and he wants to make sure the new employers don't have anything in their employment contracts that would end up grabbing it.
The old employers trying to claim it was also a concern, but that wasn't what OP was concerned about so I didn't mention it. He had a lawyer check over his old employment contract as well to make sure there wasn't a problem there. As long as he's not using proprietary tech retained from the old job (and he's not) there's no problem there.
Anything you do in your own time is generally unenforceable
With the important caveat that your employment contract may include clauses that give them rights over that stuff anyway, and even if they're unenforceable you could still end up having to fight in court over it.
Definitely something to keep in mind when reading the contract over, and ideally get a lawyer to take a look. It can be expensive, but weigh that expense against the potential expense of what would happen if you get screwed over.
Academic Torrents has Reddit data up to December 2023. This data isn't live-updated, my understanding is that it's scraped when it's first posted. That's how services like removeddit worked, it would show the "original" version of a post or comment from when it was scraped rather than the edited or deleted version that Reddit shows now.
The age isn't really the most important thing when it comes to training a base AI model. If you want to teach it about current events there are better ways to do that than social media scrapes. Stuff like Reddit is good for teaching an AI about how people talk to each other.