When a site notices a spike in their pageviews it’s important to check how web crawlers fit into the mix.
How to see if your site is hit by crawlers
Head to the Consolidated Pageviews
report (/admin/reports/consolidated_page_views
) on your admin dashboard. Here you will see a breakdown of pageview numbers from Logged in users, Anonymous Users, and Crawlers.
A site where crawlers work normally:
A site where crawlers are out of control:
Which crawlers are hitting your site
Navigate to the Web Crawler User Agent
report (/admin/reports/web_crawlers
) to find a list of web crawler names ordered by highest pageview count.
When a problematic web crawler hits the site, the number of its pageviews will be much higher than the other web crawlers. Note that there may be a number of malicious web crawlers at work at the same time.
How do you know which crawlers to block and which to limit
It is a good habit not to block the crawlers of the main search engines, such as Google, Bing, Baidu (Chinese), Yandex (Russian), Naver (Korean), DuckDuckGo, Yahoo and others, based on your country.
When a web crawler is out of control just do a search on the internet to figure out if it is malicious or not. There is a good chance that the same crawler has hit other sites and someone else has already asked for information or created reports about it that will be useful to understand whether to limit or block that particular crawler.
Note that some crawlers may contribute a large number of pageviews if you use third-party services to monitor or add functionality to your site via scripts, etc.
To obtain a record of untrustworthy web crawlers, you may refer to this list, https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/blob/master/robots.txt/robots.txt
Which settings do you need to change
You’ll want to make sure you have identified the proper crawler user agent name. If you adjust any of the settings below and do note see a reduction in pageviews of that agent, you may want to double check that your are using the proper name.
Under Admin > Settings there are some settings that can help rate limit specific crawlers:
slow down crawler user agents
slow down crawler rate
and block crawlers:
blocked crawler user agents
When in doubt about how to act, always start with the “slow down” option rather than a full block. Check over time if there are improvements. You can proceed with a full block if you do not notice appreciable results.