How self-hosted here are dealing with bad crawlers?

Reading this thread: Devs say AI crawlers dominate traffic, forcing blocks on entire countries | Hacker News

I wonder what it’s like for self-hosted people to deal with crawlers practically doing DDoS non-stop, especially on instances within the Fediverse.

2 Likes

I think a good first step is to quantify for yourself how big of an issue this is using the “new” pageview metric:

If you’re seeing something like 60% non-human traffic, that’s probably fine and you don’t need to take action.
If it’s 95%.. yeah, might be time to start investigating solutions.

Setting Blocked crawler user agents is the admin’s friend. The trash traffic isn’t such a big issue with Discourse because the load isn’t that heavy. But I’ve banned a handful of the worst ones because I really dislike their business model. Everyone is crying about how AI companies are stealing content, which they are actually doing, but SEO companies are much worse — and their bots are really greedy.

But I’m using geo-blocking too, because I can. There are at least half a dozen countries that are sources of knockers and other malicious actors. But if a forum is for a global audience, that isn’t possible, of course.

With my WordPress sites, the same thing is done using Nginx with the help of Varnish.

At the moment, the ratio of humans vs. bots is something like 50/50 in my forum.

1 Like

BTW, tag isn’t right, I assume.

I agree, the AI tag has a plugin icon so I assume it is meant for the AI plugin only. I’ve removed it.

Crawler content gets heavily cached so in practice I’ve never seen them to be able to DDoS.

Are you actually having performance issues because of this?

5 Likes

I wish I could say I had some solution that was free, or didn’t involve some outside service. I put my biggest forum behind bunny.net’s CDN. They have a generous free tier. But for that forum I go ahead and pay the $10/month to get their security service. It lets me block crawlers, DDoS, and geographically. As CDNs go, they’re really cheap but effective, and they’re not CloudFlare. A lot of folks on the fediverse rate them highly.

I’ve a graph from their Shield service. (I’m a n00b, only 1 graph per reply :slight_smile: ) In the first, there were 484K bot connections out of 2M overall connections. I had just moved to the CDN and didn’t have any filtering or blocking in. The next shows 11K bots, and 90K blocked due to access lists (I block China and Russia and maybe a couple others). So that’s about 100K from bots on a total of 700K requests that week.

After:

2 Likes

I was but I did some rules to handle it.

Chandler Bing: 'Yeah, but I'm so much faster'

Cloudflare has always been nice to me and I’ve never had to pay for anti-bot services. That and their newer stuff like anti-AI is great and what keeps me a customer and shill for them I guess. Don’t want AI scrapers stealing your data? Just use one of their managed rules (given it’s entirely possible using just normal robotstxt like I do on my site)

.. vs a generic managed one, way better..

Whether or not these startups actually listen to and respect the file is another story, but good on them for trying at least. None of my sites have issues with bots in the past and I’m still repeatedly happy with the ability to block common wordpress exploits directly there after reading my logs.

1 Like

Facebook (meta) has did something like that if I disable ‘AI crawlers control’ the meta simply do 9K requests per hours so only way is block all them

At fediverse I don’t have these problema for awhile but I waiting for more updates activitypub because even if I hadn’t any problema with it my bandwith will be affected will be affected for nothing


Absolutely correct I am using a lemmy server that are using CF and their admin posted this tutorial


Same here, my currenet rules are:

not (cf.client.bot and (lower(http.user_agent) contains "googlebot" or lower(http.user_agent) contains "bingbot")) and ip.src != IP_BYPASS

And based at that lemmy server above:

(starts_with(http.user_agent, "Mozilla/") and http.request.version in {"HTTP/1.0" "HTTP/1.1" "HTTP/1.2" "SPDY/3.1"} and any(http.request.headers["accept"][*] contains "text/html") and http.user_agent wildcard r"HeadlessChrome/*" and http.request.uri.path contains "/xmlrpc.php" and http.request.uri.path contains "/wp-config.php" and http.request.uri.path contains "/wlwmanifest.xml" and ip.src.asnum in {200373 198571 26496 31815 18450 398101 50673 7393 14061 205544 199610 21501 16125 51540 264649 39020 30083 35540 55293 36943 32244 6724 63949 7203 201924 30633 208046 36352 25264 32475 23033 31898 210920 211252 16276 23470 136907 12876 210558 132203 61317 212238 37963 13238 2639 20473 63018 395954 19437 207990 27411 53667 27176 396507 206575 20454 51167 60781 62240 398493 206092 63023 213230 26347 20738 45102 24940 57523 8100 8560 6939 14178 46606 197540 397630 9009 11878 49453 29802} and http.user_agent wildcard r"Mozilla/*" and not cf.client.bot and not ip.src in {BYPASS_IP_1 RANGE_IP.0/23 RANGE_IP_2/24}) or (ip.src.country in {"T1" "XX"}) or (http.request.version in {"HTTP/1.0" "SPDY/3.1" "HTTP/1.2"})

For me it’s enough

This rules helped me pass through a DDos(idk right if was) in the last month

This isn’t the place to really discuss the merits of CloudFlare, but my problem with them is not good people like you. My problem with them is all the bad people they’re perfectly willing to do business with. Anyone in the cybersecurity world who fights malware and botnets sees CloudFlare come up a lot. Likewise anyone who fights extremists online knows how often CloudFlare will protect extremist sites where other providers won’t. It’s not that they’re ineffective or that they’re too expensive. It’s the lack of morals in selecting their clientele.

2 Likes