Is there a simple way to severly impede webscraping and LLM data collection of my website?
habitualTartare @ habitualTartare @lemmy.world Posts 0Comments 75Joined 2 yr. ago
habitualTartare @ habitualTartare @lemmy.world
Posts
0
Comments
75
Joined
2 yr. ago
Removed
CNN blocks Firefox with uBo
Removed Deleted
Permanently Deleted
https://en.wikipedia.org/wiki/Robots.txt
Should cover any polite web crawlers but it is voluntary.
https://platform.openai.com/docs/gptbot
Might have to put it behind a captcha or other type to severely limit automated access.
It's not realistic to assume it won't get scraped eventually. Such as someone paying people to bypass capatcha or web crawlers that don't respect robots.txt. I also don't know if Google and Microsoft bundle their AI data collection that doesn't also remove your site from web search.