Handling Bingbot

Hi Sam,

Options 1, 2, 3 all seem like good options to me, though I feel you could have done #1, without blocking Bing. That just seems like the “lazy” option, as 2 & 3 would take more time to implement.

@codinghorror acknowledged the monopoly that exists in the search space, and you both acknowledge that Discourse actually has some weight to throw around. I would expect more then, from an organisation with weight to throw around, than to effectively muscle out a smaller player in the market due to behavior that you have deemed “bad”.

I agree, Bing may use a crawl profile which may not be ideal. It may even be categorically bad. But you have still chosen to muscle out the smaller player, who offers and obeys fairly well documented extensions to the robots.txt standards.

Like you said, many users don’t touch defaults - so how many users are about to deploy Discourse, only to find that they never get indexed on Bing, due to those defaults. How many users are going to be harmed, should Bing establish greater market penetration in a local market. What happens when “Cortana” suddenly becomes the killer voice assistant? Microsoft already sits on 89% of desktops worldwide, if they play their cards right, it could happen. Essentially, you’ve burdened your users with risk, simply because you have the weight to do so, to stomp your feet, and get your way.

So please, do #1, #2, and #3. Kick up a fuss on Twitter. Message the Bing team. Contact your Microsoft representatives. It’s not like the names “Sam Saffron” and “Jeff Atwood” are so unknown that you’re not going to get a response from direct contact, either via email, or via Twitter. But don’t just blanket block a smaller player in a market, because they’re not doing what you would like.

Indeed, you could even have blocked Bing due to it’s low market share. It’s not that unusual for companies to exclude support for various low-impact platforms (and at the end of the day, you still must have the best interests of your own company at heart). It is rather unusual to boast about it though.

3 Likes

but …

So which one is it - defaults are trivial - or people don’t touch them :wink:

2 Likes

Just putting this out here, while discussing this with Microsoft (the engineers who build Bing and PM for Bing)

  • Microsoft never told me that this blocking will cause irreparable damage to sites ranking

  • Microsoft suggested a site map as a workaround

  • Microsoft did not make an explicit recommendation to use Crawl delay vs blocking

  • Microsoft said they want to fix the underlying issue

We are testing the site map theory.

My gut feel on this is that crawl delay vs straight out telling bot not to crawl will have almost the same effect if this only goes on for a month or so. long term crawl delay is not a proper solution cause the backlog they have is too huge and they can not work through all the urls they want to.

We are only telling Bing not to crawl, we say nothing on the pages asking Bing or any crawler not to index using meta tags

So, basically what we have here is a bunch of people who have a gut feel that due to SEO reasons blocking a crawler for N days via robots.txt while we work this out with Bing will damage sites forever in Bing. My gut feel is that the old content will remain in Bing for at least a few months if not more.

Now, if Microsoft tell me that what we are doing will have irreparable damage going forward, I think it would have more weight.

If anything brand new sites deploying Discourse at the moment have more of a chance finding this out that we messed with Bing and find this topic. Crawl delay would mask that.

18 Likes

It isn’t a completely random gut feeling – about 10 years ago part of my job was fixing large sites so that they could be crawled more easily (which can make a big difference in indexing and ranking – sometimes sites even get accidentally banned in search results due to crawling issues).

I’ve seen it happen with Google, and it was just an accidental Disallow: / for a few days.

My anecdotal evidence here is that this does not happen anymore, we had Google blocked from crawling meta for 20 days a while back due to a bug in our robots file, the blocking made zero difference long term.

Also, we are talking about Bing here not Google, so evidence about what Google did 10 years ago is not particularly strong when we are talking about Bing.

Besides

  • The Microsoft PM for Bing is probably reading this topic
  • Multiple Engineers on the Bing team are reading this topic

None of them said

Sam stop, if you block us for a few weeks you are going to destroy everything in our index for your Discourse sites. What they seem to be saying is that stuff needs fixing and they want to get to a state where we are comfortable unblocking Bing again.

We are running multiple experiments at the moment. Microsoft are working on the problem. In the meantime while this is happening “bleeding by default” is reduced to zero.

19 Likes

From my experience Bingbot always needs a little bit more help when it comes to crawling, especially caching crawled results.
Bingbot for example ignores E-Tag / I’ve never seen Bingbot asking for E-Tags.

HTTP-Headers of discourse explicitly tell NOT to cache the results (“cache-control: no-store, must-revalidate, no-cache, private”)

Google tries to apply heuristics to ignore those headers if they don’t feel they are helpful.
(Doesn’t mean that it works all the time. Have seen Google crawling a random 404 of a product page on a shop more than once every minute for months)
Bingbot sticks more with what webmasters tell them. So if webmasters tell them to ignore the crawled page they will crawl again.

3 Likes

Update: Bing will now be heavily throttled - one request per 60 seconds - while this is being worked on.

https://github.com/discourse/discourse/commit/6179c0ce51bc1d9d814a1baae354d68eb491e9fd

There has also been an update to the original #feature:announcements topic.

9 Likes

Well, Bing is my default search engine… Surprise! When I’m in China.

That’s because it is just about then only global search engine that works in China. Google, Yahoo, etc. All blocked.

If you want china sites then Baidu is fine. But if you want non Chinese non China sites with info that is not five years and two generations out of date you really need Bing…

5 Likes

Worry no more! Two years later, things have been resolved.

I’m glad that this case has finally come to a close. As a result, I think that it would be most appropriate to close this topic or merge it with the one I linked above.

2 Likes