How self-hosted here are dealing with bad crawlers?

Reading this thread: Devs say AI crawlers dominate traffic, forcing blocks on entire countries | Hacker News

I wonder what it’s like for self-hosted people to deal with crawlers practically doing DDoS non-stop, especially on instances within the Fediverse.

2 Likes

I think a good first step is to quantify for yourself how big of an issue this is using the “new” pageview metric:

If you’re seeing something like 60% non-human traffic, that’s probably fine and you don’t need to take action.
If it’s 95%.. yeah, might be time to start investigating solutions.

Setting Blocked crawler user agents is the admin’s friend. The trash traffic isn’t such a big issue with Discourse because the load isn’t that heavy. But I’ve banned a handful of the worst ones because I really dislike their business model. Everyone is crying about how AI companies are stealing content, which they are actually doing, but SEO companies are much worse — and their bots are really greedy.

But I’m using geo-blocking too, because I can. There are at least half a dozen countries that are sources of knockers and other malicious actors. But if a forum is for a global audience, that isn’t possible, of course.

With my WordPress sites, the same thing is done using Nginx with the help of Varnish.

At the moment, the ratio of humans vs. bots is something like 50/50 in my forum.

1 Like

BTW, tag isn’t right, I assume.

I agree, the AI tag has a plugin icon so I assume it is meant for the AI plugin only. I’ve removed it.

Crawler content gets heavily cached so in practice I’ve never seen them to be able to DDoS.

Are you actually having performance issues because of this?

2 Likes