Controlling Web Crawlers For a Site

:bookmark: This guide explains how to manage web crawlers on your Discourse site.

:person_raising_hand: Required user level: Administrator

Web crawlers can significantly impact your site’s performance by increasing pageviews and server load.

When a site notices a spike in their pageviews it’s important to check how web crawlers fit into the mix.


Checking for crawler activity

To see if crawlers are affecting your site, navigate to the Consolidated Pageviews report (/admin/reports/consolidated_page_views ) from your admin dashboard. This report breaks down pageview numbers from logged-in users, anonymous users, and crawlers.

A site where crawlers work normally:

A site where crawlers are out of control:

Identifying specific crawlers

Go to the Web Crawler User Agent report (/admin/reports/web_crawlers) to find a list of web crawler names sorted by pageview count.

When a problematic web crawler hits the site, the number of its pageviews will be much higher than the other web crawlers. Note that there may be a number of malicious web crawlers at work at the same time.

Blocking and limiting crawlers

It is a good habit not to block the crawlers of the main search engines, such as Google, Bing, Baidu (Chinese), Yandex (Russian), Naver (Korean), DuckDuckGo, Yahoo and others, based on your country.

When a web crawler is out of control there is a good chance that the same crawler has hit other sites and someone else has already asked for information or created reports about it that will be useful to understand whether to limit or block that particular crawler.

Note that some crawlers may contribute a large number of pageviews if you use third-party services to monitor or add functionality to your site via scripts, etc.

To obtain a record of untrustworthy web crawlers, you may refer to this list, https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/blob/master/robots.txt/robots.txt

Adjusting crawler settings

Under Admin > Settings there are some settings that can help rate limit specific crawlers:

  • Slow down crawlers using:

    • slow down crawler user agents
    • slow down crawler rate
  • Block crawlers with:

    • blocked crawler user agents

Ensure you know the accurate user agent name for the crawlers you wish to control. If you adjust any of the settings above and do note see a reduction in pageviews of that agent, you may want to double check that your are using the proper name.

When in doubt about how to act, always start with the “slow down” option rather than a full block. Check over time if there are improvements. You can proceed with a full block if you do not notice appreciable results.

Last edited by @SaraDev 2024-09-11T19:32:59Z

Check documentPerform check on document:
15 Likes

Should there be somekind disclaimer that this works only with good behaving ones? And even Google will bypass all of those when it comes via links from gmail.

Both are enforced on the server.

However, if a bad bot pretends to be Chrome or someone else by spoofing headers then we can not use headers to detect it…

2 Likes

Killer fact: Preview cards show count as a page view !

The server I admin appears to have been swamped with preview card requests of the form http.rb/5.1.0 (Mastodon/4.0.2; + [https://mstdn.science/](https://mstdn.science/))

I don’t think any action can be taken apart from telling mastodon posters to include an image so the preview card is not added automatically.

1 Like

I already have over 1500 hits per day by crawlers. :tired_face: Can I block them all by using Cloudflare DNS? Or what option is needed to force block them all? (Privat instance)

I simply don’t want them.

Using i.e. nginx as a reverse proxy and stopping there unwanted user agents. That helps a lot. Blocking countries you don’t need helps quite much too.

I can’t block US, France and Germany (big VPS-countries) but for me Russia, Vietnam, Iran, Iraq etc. helped quite much.

But Discourse is quite… is resilient right word. The situation is very much different than with WordPress where those useless SEO-bots, knockers, script kiddies and malicious actors can easily put a server on its knees.

2 Likes

I’m hosting at Hetzner Germany, with just two open ports in my firewall (80/443). And Discourse runs behind the NGINX proxy manager (sure, there are better solutions, but I’m a lazy person to code and like web frontends).

Now I’m going the whitelist route, with a random string as the only allowed entry … from now on, no more page views :smiley:

A question about exactly what to put in “slow down crawler user agents”.
For us Facebook is a major culprit, with Bing a close 3rd.
Report shows the following agents as the principal crawlers soaking up page views:

What exactly should be in “slow down crawler user agents” - these exact urls including “https” or “http”? Or everything after the double-slash? Or something else? Or do we just go by trial and error?

Thanks!

To keep things simple you should use the names of those bots. But you can use any part of user agent string, but be sure it doesn’t affect more than you want.

Slowing down bots is very unreliable way, but some follows that rule. But those comes from your shares etc. and doesn’t create that much workload. WordPress would be another story.

But this is part of my blocked bots list. You get the point from it.

1 Like

Thanks for this, @Jagster - very helpful. Feels like a game of whack-a-mole sometimes, but I get the idea of using part of the crawler name string rather than the whole thing.

A work in progress for me as site admin I guess - onwards!

1 Like