Bingbot is default throttled


(Sam Saffron) #1

Recently @neil added built-in support for crawler traffic analysis and blacklisting / whitelisting of crawler user-agents.

One thing that immediately popped up is that bing, consistently, across multiple sites is generating significantly more load than any other crawler.

For example on meta we have the following over about a week:

User Agent Pageviews
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) 183236
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 16117
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 15959
Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/) 9450
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) 5022
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com) 4498
Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07) 3976

Bing is crawling meta at more than 10x the rate of any other crawler. Looking at our richer logs the trend is very clear (and also cross checked):

Looking at a geomap we can see the traffic is very likely coming from Microsoft

image

Looking at specific ips I can see this is indeed coming from Microsoft using reverse IP lookups.

bing has no qualms hitting meta more than 5000 times in a 3 hour period, Google will not spike at over 800 and usually runs much slower.

Following this commit, bing is default throttled to 60 seconds per request:

You can remove this throttle bing by editing your slow_down_crawler_user_agents, but we don’t recommend it unless you understand the crawler traffic consequences.

We decided to take this measure to protect Discourse sites out there from being attacked by Microsoft crawlers. I have no idea why bing behaves so badly, my theory is that part of the reason it is crawling so aggressively is cause it is constantly trying to re-validate canonical links. In the logs I can see that 3 times a week it will try to figure out what the canonical page is for a post link. So, for example:

Even though we tell bing the canonical for https://meta.discourse.org/t/topic-stopwatch-theme-component/83939/20 is https://meta.discourse.org/t/topic-stopwatch-theme-component/83939 it does not appear to “trust” us and has to check back 3 times a week.

We have been in contact with Microsoft on this and they are working on it on their end, but resolution is months if not years away, so this is necessary for everyone’s protection in the meantime.


Handling Bingbot
Discourse 2.0.0.beta6 Release Notes
(Ryan Mulligan) #2

This commit FEATURE: bingbot heavily throttled till it plays nice · discourse/discourse@6179c0c · GitHub changes bingbot from blacklisted to throttled.