Controlling Web Crawlers For a Site

When a site notices a spike in their pageviews it’s important to check how web crawlers fit into the mix.


How to see if your site is hit by crawlers

Head to the Consolidated Pageviews report (/admin/reports/consolidated_page_views) on your admin dashboard. Here you will see a breakdown of pageview numbers from Logged in users, Anonymous Users, and Crawlers.

A site where crawlers work normally:

A site where crawlers are out of control:

Which crawlers are hitting your site

Navigate to the Web Crawler User Agent report (/admin/reports/web_crawlers) to find a list of web crawler names ordered by highest pageview count.

When a problematic web crawler hits the site, the number of its pageviews will be much higher than the other web crawlers. Note that there may be a number of malicious web crawlers at work at the same time.

How do you know which crawlers to block and which to limit

It is a good habit not to block the crawlers of the main search engines, such as Google, Bing, Baidu (Chinese), Yandex (Russian), Naver (Korean), DuckDuckGo, Yahoo and others, based on your country.

When a web crawler is out of control just do a search on the internet to figure out if it is malicious or not. There is a good chance that the same crawler has hit other sites and someone else has already asked for information or created reports about it that will be useful to understand whether to limit or block that particular crawler.

Note that some crawlers may contribute a large number of pageviews if you use third-party services to monitor or add functionality to your site via scripts, etc.

To obtain a record of untrustworthy web crawlers, you may refer to this list, https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/blob/master/robots.txt/robots.txt

Which settings do you need to change

You’ll want to make sure you have identified the proper crawler user agent name. If you adjust any of the settings below and do note see a reduction in pageviews of that agent, you may want to double check that your are using the proper name.

Under Admin > Settings there are some settings that can help rate limit specific crawlers:

  • slow down crawler user agents
  • slow down crawler rate

and block crawlers:

  • blocked crawler user agents

When in doubt about how to act, always start with the “slow down” option rather than a full block. Check over time if there are improvements. You can proceed with a full block if you do not notice appreciable results.

11 Likes

Should there be somekind disclaimer that this works only with good behaving ones? And even Google will bypass all of those when it comes via links from gmail.

Both are enforced on the server.

However, if a bad bot pretends to be Chrome or someone else by spoofing headers then we can not use headers to detect it…

2 Likes

Killer fact: Preview cards show count as a page view !

The server I admin appears to have been swamped with preview card requests of the form http.rb/5.1.0 (Mastodon/4.0.2; + [https://mstdn.science/](https://mstdn.science/))

I don’t think any action can be taken apart from telling mastodon posters to include an image so the preview card is not added automatically.

1 Like