Controlling Web Crawlers For a Site

Discourse · November 9, 2022, 12:04pm

This guide explains how to manage web crawlers on your Discourse site.

Required user level: Administrator

Web crawlers can significantly impact your site’s performance by increasing pageviews and server load.

When a site notices a spike in their pageviews it’s important to check how web crawlers fit into the mix.

Checking for crawler activity

To see if crawlers are affecting your site, navigate to the Site Traffic report (/admin/reports/site_traffic ) from your admin dashboard. This report breaks down pageview numbers from logged-in browser users, anonymous browser users, crawlers, and other sources.

A site where crawlers work normally:

A site where crawlers are out of control:

Identifying specific crawlers

Go to the Web Crawler User Agent report (/admin/reports/web_crawlers) to find a list of web crawler names sorted by pageview count.

When a problematic web crawler hits the site, the number of its pageviews will be much higher than the other web crawlers. Note that there may be a number of malicious web crawlers at work at the same time.

Blocking and limiting crawlers

It is a good habit not to block the crawlers of the main search engines, such as Google, Bing, Baidu (Chinese), Yandex (Russian), Naver (Korean), DuckDuckGo, Yahoo and others, based on your country.

When a web crawler is out of control there is a good chance that the same crawler has hit other sites and someone else has already asked for information or created reports about it that will be useful to understand whether to limit or block that particular crawler.

Note that some crawlers may contribute a large number of pageviews if you use third-party services to monitor or add functionality to your site via scripts, etc.

To obtain a record of untrustworthy web crawlers, you may refer to this list, https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/blob/master/robots.txt/robots.txt

Adjusting crawler settings

Under Admin > Settings there are some settings that can help rate limit specific crawlers:

Slow down crawlers using:
- slow down crawler user agents — by default this includes gptbot, claudebot, anthropic-ai, and brightbot
- slow down crawler rate — the number of seconds between allowed requests per crawler (default: 60)
Block crawlers with:
- blocked crawler user agents — by default this includes mauibot, semrushbot, ahrefsbot, blexbot, and seo spider
Allow only specific crawlers with:
- allowed crawler user agents — when set, only the listed crawlers will be allowed to access the site; all others will be blocked. This acts as a strict allowlist. Warning: setting this will override blocked crawler user agents and block all crawlers not on the list, including major search engines if they are not included.

Ensure you know the accurate user agent name for the crawlers you wish to control. If you adjust any of the settings above and do not see a reduction in pageviews of that agent, you may want to double check that you are using the proper name.

When in doubt about how to act, always start with the “slow down” option rather than a full block. Check over time if there are improvements. You can proceed with a full block if you do not notice appreciable results.

Last edited by @SaraDev 2024-09-11T19:32:59Z

Check document
Perform check on document:

Topic		Replies	Views
Too many Crawlers, is that a problem? Data & reporting	6	2553	June 25, 2020
Smarter handling of random crawler traffic Feature	2	3552	March 29, 2018
MegaIndex bot did about 4,000 pageviews on one day Community Building	40	4695	December 2, 2023
Web Crawlers Data & reporting	12	1277	July 31, 2023
Can I ignore some user agents? Support	6	900	August 23, 2022

Controlling Web Crawlers For a Site

Checking for crawler activity

Identifying specific crawlers

Blocking and limiting crawlers

Adjusting crawler settings

Related topics