Controlling Web Crawlers For a Site

Discourse · November 9, 2022, 12:04pm

This guide explains how to manage web crawlers on your Discourse site.

Required user level: Administrator

Web crawlers can significantly impact your site’s performance by increasing pageviews and server load.

When a site notices a spike in their pageviews it’s important to check how web crawlers fit into the mix.

Checking for crawler activity

To see if crawlers are affecting your site, navigate to the Site Traffic report (/admin/reports/site_traffic ) from your admin dashboard. This report breaks down pageview numbers from logged-in browser users, anonymous browser users, crawlers, and other sources.

A site where crawlers work normally:

A site where crawlers are out of control:

Identifying specific crawlers

Go to the Web Crawler User Agent report (/admin/reports/web_crawlers) to find a list of web crawler names sorted by pageview count.

When a problematic web crawler hits the site, the number of its pageviews will be much higher than the other web crawlers. Note that there may be a number of malicious web crawlers at work at the same time.

Blocking and limiting crawlers

It is a good habit not to block the crawlers of the main search engines, such as Google, Bing, Baidu (Chinese), Yandex (Russian), Naver (Korean), DuckDuckGo, Yahoo and others, based on your country.

When a web crawler is out of control there is a good chance that the same crawler has hit other sites and someone else has already asked for information or created reports about it that will be useful to understand whether to limit or block that particular crawler.

Note that some crawlers may contribute a large number of pageviews if you use third-party services to monitor or add functionality to your site via scripts, etc.

To obtain a record of untrustworthy web crawlers, you may refer to this list, https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/blob/master/robots.txt/robots.txt

Adjusting crawler settings

Under Admin > Settings there are some settings that can help rate limit specific crawlers:

Slow down crawlers using:
- slow down crawler user agents — by default this includes gptbot, claudebot, anthropic-ai, and brightbot
- slow down crawler rate — the number of seconds between allowed requests per crawler (default: 60)
Block crawlers with:
- blocked crawler user agents — by default this includes mauibot, semrushbot, ahrefsbot, blexbot, and seo spider
Allow only specific crawlers with:
- allowed crawler user agents — when set, only the listed crawlers will be allowed to access the site; all others will be blocked. This acts as a strict allowlist. Warning: setting this will override blocked crawler user agents and block all crawlers not on the list, including major search engines if they are not included.

Ensure you know the accurate user agent name for the crawlers you wish to control. If you adjust any of the settings above and do not see a reduction in pageviews of that agent, you may want to double check that you are using the proper name.

When in doubt about how to act, always start with the “slow down” option rather than a full block. Check over time if there are improvements. You can proceed with a full block if you do not notice appreciable results.

Last edited by @SaraDev 2024-09-11T19:32:59Z

Check document
Perform check on document:

Jagster · November 9, 2022, 12:49pm

Should there be somekind disclaimer that this works only with good behaving ones? And even Google will bypass all of those when it comes via links from gmail.

sam · November 10, 2022, 12:55am

Both are enforced on the server.

However, if a bad bot pretends to be Chrome or someone else by spoofing headers then we can not use headers to detect it…

spdegabrielle · July 11, 2023, 8:37am

Killer fact: Preview cards show count as a page view !

The server I admin appears to have been swamped with preview card requests of the form http.rb/5.1.0 (Mastodon/4.0.2; + [https://mstdn.science/](https://mstdn.science/))

I don’t think any action can be taken apart from telling mastodon posters to include an image so the preview card is not added automatically.

terraboss · April 23, 2024, 2:33pm

I already have over 1500 hits per day by crawlers. Can I block them all by using Cloudflare DNS? Or what option is needed to force block them all? (Privat instance)

I simply don’t want them.

Jagster · April 23, 2024, 3:16pm

Using i.e. nginx as a reverse proxy and stopping there unwanted user agents. That helps a lot. Blocking countries you don’t need helps quite much too.

I can’t block US, France and Germany (big VPS-countries) but for me Russia, Vietnam, Iran, Iraq etc. helped quite much.

But Discourse is quite… is resilient right word. The situation is very much different than with WordPress where those useless SEO-bots, knockers, script kiddies and malicious actors can easily put a server on its knees.

terraboss · April 24, 2024, 5:12am

I’m hosting at Hetzner Germany, with just two open ports in my firewall (80/443). And Discourse runs behind the NGINX proxy manager (sure, there are better solutions, but I’m a lazy person to code and like web frontends).

Now I’m going the whitelist route, with a random string as the only allowed entry … from now on, no more page views

PatrickF · September 14, 2024, 11:30am

A question about exactly what to put in “slow down crawler user agents”.
For us Facebook is a major culprit, with Bing a close 3rd.
Report shows the following agents as the principal crawlers soaking up page views:

What exactly should be in “slow down crawler user agents” - these exact urls including “https” or “http”? Or everything after the double-slash? Or something else? Or do we just go by trial and error?

Thanks!

Jagster · September 14, 2024, 12:39pm

To keep things simple you should use the names of those bots. But you can use any part of user agent string, but be sure it doesn’t affect more than you want.

Slowing down bots is very unreliable way, but some follows that rule. But those comes from your shares etc. and doesn’t create that much workload. WordPress would be another story.

But this is part of my blocked bots list. You get the point from it.

PatrickF · September 14, 2024, 12:57pm

Thanks for this, @Jagster - very helpful. Feels like a game of whack-a-mole sometimes, but I get the idea of using part of the crawler name string rather than the whole thing.

A work in progress for me as site admin I guess - onwards!

Jagster · July 19, 2025, 7:51am

There can be several reasons, but googlebot has its budget, and when sitemaps are the most important way to find links, it never reaches internal links when daily/weekly/monthly budget is used.

And in a forum internal links are important for users, not for Google.

But I don’t know if googlebot sees internal links. It should, though.

Topic		Replies	Views
Too many Crawlers, is that a problem? Data & reporting	6	2548	June 25, 2020
Smarter handling of random crawler traffic Feature	2	3551	March 29, 2018
MegaIndex bot did about 4,000 pageviews on one day Community Building	40	4671	December 2, 2023
Web Crawlers Data & reporting	12	1271	July 31, 2023
Can I ignore some user agents? Support	6	885	August 23, 2022

Controlling Web Crawlers For a Site

Checking for crawler activity

Identifying specific crawlers

Blocking and limiting crawlers

Adjusting crawler settings

Related topics