How to determine (and block) IP addresses of scrapers / bots through CloudFlare on Discourse?

I have a scenario where over 13,000 crawlers have been detected and database traffic has shot through the roof. Someone is clearly scraping our forum, but I’m not sure how to determine the IP and take measures against it since it’s going through CloudFlare DNS.

Any thoughts on how to do this?

I would like to set up an automatic rate limiting solution that would detect aggressive network behavior and then just rate limit them or disable them for a period of time.

Do you have the cloudflare template in your yml config file?

Yes, this is in /var/discourse/containers/app.yml:

  - "templates/cloudflare.template.yml"

I should mention that I’m really hitting a wall wherein I’m seeing the database access (I use another database server at the same site) at a constant 13 to 20 Mbps and it spiked when the crawler count went through the roof and hasn’t settled down for nearly two weeks. The server control panel is showing the traffic to have spiked to ridiculous levels at this same time and to have not come down from this level of constant access and is going so far as to be heading towards running the servers out of bandwidth if this keeps up.

Do you have the rate limiting template included? Do you see the correct ip numbers on the logs (not the cloudflare ips)?

I do have this also in app.yml:

  - "templates/web.ratelimited.template.yml"

I also do see the correct IP addresses for users - I’m not sure what you mean by the logs. Hmm.

1 Like

As you may recall, I’ve thought to try to stem off the bandwidth consumption at this point by just changing everything to use internal private IP addresses since that will at least stop me from getting some large bills.

I have noticed something else peculiar here. I changed the database connection away from the public IP to the private IP so as to not consume the monthly transfer allowance as quickly, but on the database server I expected to only see connections made via the private IP address from the Discourse docker server. I DO see traffic from the local private IP now, but I still see inordinate amounts of traffic coming from the public IP and thus still rapidly consuming the monthly allowance.

I’ve looked and looked for both the public IP address and the hostname of the database server on the Discourse Docker server, but I can’t find it anywhere. Even if I go into the app (./launcher enter app) and do an env | grep DB I see the correct PRIVATE IP address for the LAN being used here. I can grep through the filesystem and I just don’t see occurrences that I’d expect.

Any thoughts on how Discourse or the Docker image might be still accessing the wrong IP? I just cannot figure out where all of this public IP traffic is spawning from on the Discourse server.

1 Like