How to determine (and block) IP addresses of scrapers / bots through CloudFlare on Discourse?

mreach · May 11, 2021, 1:56am

I have a scenario where over 13,000 crawlers have been detected and database traffic has shot through the roof. Someone is clearly scraping our forum, but I’m not sure how to determine the IP and take measures against it since it’s going through CloudFlare DNS.

Any thoughts on how to do this?

I would like to set up an automatic rate limiting solution that would detect aggressive network behavior and then just rate limit them or disable them for a period of time.

pfaffman · May 11, 2021, 2:37am

Do you have the cloudflare template in your yml config file?

mreach · May 11, 2021, 5:07am

Yes, this is in /var/discourse/containers/app.yml:

  - "templates/cloudflare.template.yml"

mreach · May 11, 2021, 5:19am

I should mention that I’m really hitting a wall wherein I’m seeing the database access (I use another database server at the same site) at a constant 13 to 20 Mbps and it spiked when the crawler count went through the roof and hasn’t settled down for nearly two weeks. The server control panel is showing the traffic to have spiked to ridiculous levels at this same time and to have not come down from this level of constant access and is going so far as to be heading towards running the servers out of bandwidth if this keeps up.

pfaffman · May 12, 2021, 1:53am

Do you have the rate limiting template included? Do you see the correct ip numbers on the logs (not the cloudflare ips)?

mreach · May 12, 2021, 6:00am

I do have this also in app.yml:

  - "templates/web.ratelimited.template.yml"

I also do see the correct IP addresses for users - I’m not sure what you mean by the logs. Hmm.

mreach · May 13, 2021, 6:04am

As you may recall, I’ve thought to try to stem off the bandwidth consumption at this point by just changing everything to use internal private IP addresses since that will at least stop me from getting some large bills.

I have noticed something else peculiar here. I changed the database connection away from the public IP to the private IP so as to not consume the monthly transfer allowance as quickly, but on the database server I expected to only see connections made via the private IP address from the Discourse docker server. I DO see traffic from the local private IP now, but I still see inordinate amounts of traffic coming from the public IP and thus still rapidly consuming the monthly allowance.

I’ve looked and looked for both the public IP address and the hostname of the database server on the Discourse Docker server, but I can’t find it anywhere. Even if I go into the app (./launcher enter app) and do an env | grep DB I see the correct PRIVATE IP address for the LAN being used here. I can grep through the filesystem and I just don’t see occurrences that I’d expect.

Any thoughts on how Discourse or the Docker image might be still accessing the wrong IP? I just cannot figure out where all of this public IP traffic is spawning from on the Discourse server.

Topic		Replies	Views
Only allow Cloudflare IPs for Discourse server Installation	18	3642	June 21, 2023
View IP address of guests / anonymous visitors? Data & reporting	13	1331	January 13, 2022
Discourse, Cloudflare and IP Bans Support	9	2230	December 4, 2020
Security section shows Cloudflare IP instead of user's IP Installation	5	463	February 26, 2024
How to solve the problem of source IP leakage and DD attacks even when using Cloudflare CDN? Support	11	172	October 15, 2024

How to determine (and block) IP addresses of scrapers / bots through CloudFlare on Discourse?

Related topics