Bingbot is (no longer) default throttled

sam · April 5, 2018, 6:16am

Recently @neil added built-in support for crawler traffic analysis and blocklisting / allowlisting of crawler user-agents.

One thing that immediately popped up is that bing, consistently, across multiple sites is generating significantly more load than any other crawler.

For example on meta we have the following over about a week:

User Agent	Pageviews
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)	183236
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)	16117
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)	15959
Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)	9450
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)	5022
Mozilla/5.0 (compatible; DotBot/1.1; `http://www.opensiteexplorer.org/dotbot`, `help@moz.com`)	4498
Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)	3976

Bing is crawling meta at more than 10x the rate of any other crawler. Looking at our richer logs the trend is very clear (and also cross checked):

Looking at a geomap we can see the traffic is very likely coming from Microsoft

Looking at specific ips I can see this is indeed coming from Microsoft using reverse IP lookups.

bing has no qualms hitting meta more than 5000 times in a 3 hour period, Google will not spike at over 800 and usually runs much slower.

Following this commit, bing is default throttled to 60 seconds per request:

https://github.com/discourse/discourse/commit/6179c0ce51bc1d9d814a1baae354d68eb491e9fd

You can remove this throttle bing by editing your slow_down_crawler_user_agents, but we don’t recommend it unless you understand the crawler traffic consequences.

We decided to take this measure to protect Discourse sites out there from being attacked by Microsoft crawlers. I have no idea why bing behaves so badly, my theory is that part of the reason it is crawling so aggressively is cause it is constantly trying to re-validate canonical links. In the logs I can see that 3 times a week it will try to figure out what the canonical page is for a post link. So, for example:

Even though we tell bing the canonical for https://meta.discourse.org/t/topic-stopwatch-theme-component/83939/20 is https://meta.discourse.org/t/topic-stopwatch-theme-component/83939 it does not appear to “trust” us and has to check back 3 times a week.

We have been in contact with Microsoft on this and they are working on it on their end, but resolution is months if not years away, so this is necessary for everyone’s protection in the meantime.

Justin_Vega · January 5, 2020, 1:50am

Are you able to give us an update on the situation (almost two years later)?

sam · January 5, 2020, 1:56am

The update is that we raised a big commotion, they promised stuff will change and never contacted us since.

Not sure if there is anything more to do here cause it does respect the slow down command so I guess, problem solved.

I still have very low confidence with the crawling approach bing takes.

Justin_Vega · January 5, 2020, 2:00am

Microsoft could be introducing a “new” Bing (or maybe I’m just reading this blog post incorrectly). Could that mean anything?

awesomerobot · January 6, 2020, 4:21pm

Nothing in that post indicates any specific Bing changes that relate to how they crawl public sites, so suspect not.

Justin_Vega · September 1, 2020, 4:22pm

The Bing team recently created a new Twitter account for the engineering team, and so I decided to take advantage of that and tell them about the problems Discourse was having with Bing. I was able to get the attention of two people on the Bing team, but I need further assistance from people more knowledgable with Discourse. I’m not exactly an expert. https://twitter.com/facan/status/1300707035822960641?s=20 https://twitter.com/CoperniX/status/1300511151743066112?s=20 https://twitter.com/CoperniX/status/1300508479447130112?s=20

facan · September 2, 2020, 5:32am

Thanks for letting us know Justin. I am the Bing Program Manager managing the Bing crawling and indexing team. Sad to see our crawler apparently crawling too much on your Content Management System. As you said. we had issues and we adjusted… your feedback is telling us that we may have to adjust more. When we started looking at what you crawling on your web sites leveraging your CMS, as you know more than us, it will be nice if you can come back directly to me sharing example of logs what we deep dive with you.

The reality is that you know more than us what is changing your web sites… so these days, we are really encouraging web sites and Content Management System to adopt our URLs submission API Bing Webmaster Tools allowing real time indexing for added, updated , deleted content … allowing ultimately us to crawl only what’s has been modified. We have open source our code for Wordpress [Bing URL Submissions Plugin – WordPress plugin | WordPress.org] encouraging you to have a look and integrate… we can help.

Terrapop · September 2, 2020, 8:10am

It’s not really only related to Discourse, Bing is by default very aggressive crawling websites @facan

Justin_Vega · September 2, 2020, 12:19pm

The Discourse team knows a ton more about this than I do. I just wanted to raise your attention about the issue, mostly in hopes that the Discourse team would take it from here.

riking · September 10, 2020, 12:54am

There’s a clear test case here:

Can you try taking this to the developers and see if they can find the source of the bad behavior?

facan · September 10, 2020, 2:30am

Good day Sam,

I am the Program Manager managing the Bing crawling team. Can you please share the IP address(es) for Topic Stopwatch and Topic Stopwatch… according to our logs, we did not fetch these 2 URLs once in the past 2 weeks.

Thanks,
Fabrice

sam · September 10, 2020, 2:41am

That is very very old information bingbot has a crawl delay here and on all default Discourse instances.

Out of courtesy I will remove it from meta for 14 days and try to determine if it is indeed playing nicer now.

TheBestPessimist · October 12, 2020, 4:28am

Out of curiosity: do we have a status update here?

sam · October 12, 2020, 4:31am

Still under investigation, results should be available in a couple of weeks

sam · October 26, 2020, 11:44pm

Fabrice, we tested this on a few sites and crawling behavior appears a lot more reasonable.

After this is merged the default throttle will be removed.

https://github.com/discourse/discourse/pull/11035

anon23393886 · November 20, 2020, 5:50pm

I already had the throttling disabled on my own site. Glad to see that this will apply to all Discourse forums now with no manual intervention!

Frédéric · November 20, 2020, 8:46pm

Just joined to say that my colleague and I are glad to hear that bingbot is indexing this site correctly. Probably best to lock this thread now that we have sorted your issue. Wouldn’t make sense to allow more replies for something that has already been solved.

Topic		Replies	Views
Handling Bingbot Feature	29	7356	November 20, 2020
Bingbot is at it again? General	1	639	December 19, 2023
MegaIndex bot did about 4,000 pageviews on one day Community	40	4441	December 2, 2023
Sudden drop in traffic Community	40	4052	December 15, 2022
Massive traffic drop from Google searches after migrating from myBB Support	32	5439	March 17, 2022

Bingbot is (no longer) default throttled

Related topics