This guide explains how to manage web crawlers on your Discourse site.
Required user level: Administrator
Web crawlers can significantly impact your site’s performance by increasing pageviews and server load.
When a site notices a spike in their pageviews it’s important to check how web crawlers fit into the mix.
Checking for crawler activity
To see if crawlers are affecting your site, navigate to the Consolidated Pageviews report (/admin/reports/consolidated_page_views
) from your admin dashboard. This report breaks down pageview numbers from logged-in users, anonymous users, and crawlers.
A site where crawlers work normally:
A site where crawlers are out of control:
Identifying specific crawlers
Go to the Web Crawler User Agent report (/admin/reports/web_crawlers
) to find a list of web crawler names sorted by pageview count.
When a problematic web crawler hits the site, the number of its pageviews will be much higher than the other web crawlers. Note that there may be a number of malicious web crawlers at work at the same time.
Blocking and limiting crawlers
It is a good habit not to block the crawlers of the main search engines, such as Google, Bing, Baidu (Chinese), Yandex (Russian), Naver (Korean), DuckDuckGo, Yahoo and others, based on your country.
When a web crawler is out of control there is a good chance that the same crawler has hit other sites and someone else has already asked for information or created reports about it that will be useful to understand whether to limit or block that particular crawler.
Note that some crawlers may contribute a large number of pageviews if you use third-party services to monitor or add functionality to your site via scripts, etc.
To obtain a record of untrustworthy web crawlers, you may refer to this list, https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/blob/master/robots.txt/robots.txt
Adjusting crawler settings
Under Admin > Settings there are some settings that can help rate limit specific crawlers:
-
Slow down crawlers using:
slow down crawler user agents
slow down crawler rate
-
Block crawlers with:
blocked crawler user agents
Ensure you know the accurate user agent name for the crawlers you wish to control. If you adjust any of the settings above and do note see a reduction in pageviews of that agent, you may want to double check that your are using the proper name.
When in doubt about how to act, always start with the “slow down” option rather than a full block. Check over time if there are improvements. You can proceed with a full block if you do not notice appreciable results.
Last edited by @SaraDev 2024-09-11T19:32:59Z
Check document
Perform check on document: