I think a good first step is to quantify for yourself how big of an issue this is using the “new” pageview metric:
If you’re seeing something like 60% non-human traffic, that’s probably fine and you don’t need to take action.
If it’s 95%.. yeah, might be time to start investigating solutions.
Setting Blocked crawler user agents is the admin’s friend. The trash traffic isn’t such a big issue with Discourse because the load isn’t that heavy. But I’ve banned a handful of the worst ones because I really dislike their business model. Everyone is crying about how AI companies are stealing content, which they are actually doing, but SEO companies are much worse — and their bots are really greedy.
But I’m using geo-blocking too, because I can. There are at least half a dozen countries that are sources of knockers and other malicious actors. But if a forum is for a global audience, that isn’t possible, of course.
With my WordPress sites, the same thing is done using Nginx with the help of Varnish.
At the moment, the ratio of humans vs. bots is something like 50/50 in my forum.
I wish I could say I had some solution that was free, or didn’t involve some outside service. I put my biggest forum behind bunny.net’s CDN. They have a generous free tier. But for that forum I go ahead and pay the $10/month to get their security service. It lets me block crawlers, DDoS, and geographically. As CDNs go, they’re really cheap but effective, and they’re not CloudFlare. A lot of folks on the fediverse rate them highly.
I’ve a graph from their Shield service. (I’m a n00b, only 1 graph per reply ) In the first, there were 484K bot connections out of 2M overall connections. I had just moved to the CDN and didn’t have any filtering or blocking in. The next shows 11K bots, and 90K blocked due to access lists (I block China and Russia and maybe a couple others). So that’s about 100K from bots on a total of 700K requests that week.
Cloudflare has always been nice to me and I’ve never had to pay for anti-bot services. That and their newer stuff like anti-AI is great and what keeps me a customer and shill for them I guess. Don’t want AI scrapers stealing your data? Just use one of their managed rules (given it’s entirely possible using just normal robotstxt like I do on my site)
Whether or not these startups actually listen to and respect the file is another story, but good on them for trying at least. None of my sites have issues with bots in the past and I’m still repeatedly happy with the ability to block common wordpress exploits directly there after reading my logs.
Facebook (meta) has did something like that if I disable ‘AI crawlers control’ the meta simply do 9K requests per hours so only way is block all them
At fediverse I don’t have these problema for awhile but I waiting for more updates activitypub because even if I hadn’t any problema with it my bandwith will be affected will be affected for nothing
This isn’t the place to really discuss the merits of CloudFlare, but my problem with them is not good people like you. My problem with them is all the bad people they’re perfectly willing to do business with. Anyone in the cybersecurity world who fights malware and botnets sees CloudFlare come up a lot. Likewise anyone who fights extremists online knows how often CloudFlare will protect extremist sites where other providers won’t. It’s not that they’re ineffective or that they’re too expensive. It’s the lack of morals in selecting their clientele.