Since I just got this in Twitter, it kind of hit a nerve
Furthermore I was not particularly nice to Bing earlier in the topic with my trollish remark and have been extremely pleasantly surprised by the amount of listening Bing staff are willing to do here.
Corporate consolidation is a HUGE problem
Like it or not, if you live in the US you are probably searching using either Google or Bing
If you are using:
Yahoo you are actually using Bing of Google behind it
DuckDuckGo you are using Yahoo which in turn uses Google or Bing
In the USA Bing is pretty much the only alternative to Google, shutting Bing out is basically handing the keys to Google and letting there be no competition.
Google crawl extremely well, there is a reason they do not trust stuff like Crawl-delay
, they know better, and they decided you don’t need this.
Bing on the other hand have algorithms that struggle extremely hard to detect how many new URLs a site has in a week (especially if there are a lot of canonicals), they hammer sites with HTTP requests struggling to find new content.
On a site with 2000 brand new URLs in one week, Bing will hit it with 180 thousand web requests when Google/Yandex/Baidu will get away with a few thousand. Note though that meta is not really a giant target for Yandex/Baidu so there is that.
There are a few schools of thought of how to react to this big mess:
-
Kick up a giant political fuss hoping that Microsoft will listen and correct the bad crawling algorithm. We have some weight here cause we power a lot of sites and people do not usually touch defaults.
-
Make Discourse somehow friendlier to crawlers so crawlers can do less work to figure out the content.
-
Add Crawl-delay
to stop the bleeding (which leaves some big open questions)
-
Use an esoteric solution here like rate limiting by user agent, pick your poison of choice.
-
Do nothing.
We are doing a combination here.
So yes Peter from Twitter, we are kicking up a major fuss here. We do not do this often, but sometimes we have to.
We are kicking it up cause it impacts you and other people who search. There is an internal bug in Bing or some problem implementation that makes it hammer sites it should not be hammering. This is not isolated to Discourse.
I think it is quite toxic to have the approach of “it’s just a request a second, chillax and fix your framework”. Something is ill here, very ill and it should be corrected in Bing.
I am still on the fence on whether adding Crawl-delay
while working through this with Microsoft is better than outright banning, we are discussing this internally. The upside of crawl delay is that Microsoft respect it and it will cut down on traffic, the downside is that they somehow think they need 180k requests, cutting it down to a reasonable 18k requests has no guarantees it will pick the right URLs to crawl, odds are it will not but who knows.
In the meantime we are working closely with Microsoft to see if they can correct this. We are also experimenting to see what kind of impact adding a sitemap has on this problem algorithm.