Sure! You’ll probably want to look at train-text-from-scratch in the llama.cpp project, it runs on pure CPU. The (admittedly little docs) should help, otherwise ChatGPT is a good help if you show it the code. NanoGPT is fine too.
For dataset, maybe you could train on French Wikipedia, or scrape from a French story site or fan fiction or whatever. Wikipedia is probably easiest, since they provide downloadable offline versions that are only a couple gigs.
A simple way would be to load the comments themselves, and then check for blocked users. But this would basically ddos the instance servers and would be extremely janky lol
The technology of compression a diffusion model would have to achieve to realistically (not too lossily) store “the training data” would be more valuable than the entirety of the machine learning field right now.
I dunno. Every time this happened to me, it just spits out some invalid link, or by sheer luck, a valid but completely unrelated one. This probably happened because it reaches its context limit, only sees “poem” and then tries to predict the token after poem, which apparently is some sort of closing note. What I’m trying to argue is that this is just sheer chance, I mean you can only have so many altercations of text.
but it yummy
also expensive as fuck wtf the last time I had it it was like 12 bucks never again