Handling Bingbot

Bing is an insignificant source of traffic. If you can prove otherwise with actual data from a site you personally operate please do so. If you cannot, well…

3 Likes

My data (se23.life) agrees:

About 1% of visits.

5 Likes

Which would be fine if it didn’t hit sites 20x harder than Google… lots of replies on Twitter corroborated this bad behavior so it is not just Discourse either.

Tweet 1
Tweet 2
Tweet 3
Tweet 4

4 Likes

I can see how it’s difficult to believe, but a lot of small business owners do put a lot of value on it. :slight_smile:

Especially in the early days, forums need to capture a few regular posters to build the community. Those few critical users might come from any source, including Bing or Yahoo. If you only have 10 visitors per day, then losing 10% of your traffic might include one of those early enthusiasts. (The numbers are just arbitrary examples.) Sometimes just a few users are very important for a site’s growth and culture.

Sure, my web stats show that some of the regulars on one site return because they come back via a search from one of those Bing-powered sources. For example, users who arrive via DuckDuckGo spend twice as much time (over 8 minutes) on my site as the average user.

Bing powers Yahoo, which is much more popular in places like Taiwan and Japan. DuckDuckGo also uses Bing, among other sources. An example where it could matter is a travel site. Some of the users might be from countries where they are using Yahoo or they have Windows 10 and didn’t change the default settings. They might no longer find Discourse sites when they search the Web.

Bing might not seem like a major source of traffic, but I help a lot of people with computer use (via organizing a programming club in Berkeley) and it’s surprising how many of the Windows 10 users end up on Bing one way or another.

Also, a lot of small business owners just love to show up in search engine results regardless of the effects on their businesses. :slight_smile:

Just some friendly, constructive feedback…

1 Like

I don’t think that only percentages should be considered. One of those 7,447 Bing users might be the one who starts the topic that gets a million page views.

1 Like

Since I just got this in Twitter, it kind of hit a nerve

Furthermore I was not particularly nice to Bing earlier in the topic with my trollish remark and have been extremely pleasantly surprised by the amount of listening Bing staff are willing to do here.

Corporate consolidation is a HUGE problem

Like it or not, if you live in the US you are probably searching using either Google or Bing

If you are using:

Yahoo :arrow_right: you are actually using Bing of Google behind it
DuckDuckGo :arrow_right: you are using Yahoo which in turn uses Google or Bing

In the USA Bing is pretty much the only alternative to Google, shutting Bing out is basically handing the keys to Google and letting there be no competition.

Google crawl extremely well, there is a reason they do not trust stuff like Crawl-delay, they know better, and they decided you don’t need this.

Bing on the other hand have algorithms that struggle extremely hard to detect how many new URLs a site has in a week (especially if there are a lot of canonicals), they hammer sites with HTTP requests struggling to find new content.

On a site with 2000 brand new URLs in one week, Bing will hit it with 180 thousand web requests when Google/Yandex/Baidu will get away with a few thousand. Note though that meta is not really a giant target for Yandex/Baidu so there is that.

There are a few schools of thought of how to react to this big mess:

  1. Kick up a giant political fuss hoping that Microsoft will listen and correct the bad crawling algorithm. We have some weight here cause we power a lot of sites and people do not usually touch defaults.

  2. Make Discourse somehow friendlier to crawlers so crawlers can do less work to figure out the content.

  3. Add Crawl-delay to stop the bleeding (which leaves some big open questions)

  4. Use an esoteric solution here like rate limiting by user agent, pick your poison of choice.

  5. Do nothing.

We are doing a combination here.

So yes Peter from Twitter, we are kicking up a major fuss here. We do not do this often, but sometimes we have to.

We are kicking it up cause it impacts you and other people who search. There is an internal bug in Bing or some problem implementation that makes it hammer sites it should not be hammering. This is not isolated to Discourse.

I think it is quite toxic to have the approach of “it’s just a request a second, chillax and fix your framework”. Something is ill here, very ill and it should be corrected in Bing.

I am still on the fence on whether adding Crawl-delay while working through this with Microsoft is better than outright banning, we are discussing this internally. The upside of crawl delay is that Microsoft respect it and it will cut down on traffic, the downside is that they somehow think they need 180k requests, cutting it down to a reasonable 18k requests has no guarantees it will pick the right URLs to crawl, odds are it will not but who knows.

In the meantime we are working closely with Microsoft to see if they can correct this. We are also experimenting to see what kind of impact adding a sitemap has on this problem algorithm.

22 Likes

It has one benefit @sam - blocking Bing outright can cause sites to be harmed or even fall out of their index.

Rate limiting in the interim would allow Bing to retain a connection and presumably reduce any impact.

4 Likes

Which is good and correct, because my working theory is that nobody will even notice. That’s how bad Bing is.

I am somewhat sympathetic to the argument that there are literally zero alternatives to Google other than Bing, which does sadly turn out to be true.

4 Likes

We see about 3% of our search-originated traffic from Bing (97% is Google, ~0% for Yandex and Baidu).

Several of our clients are probably unable to switch (big corporations have some byzantine rules)

4 Likes

I’m far from a fan of bing, and I’m hoping that’s a joke, because otherwise what you’re saying is that we’re now deciding to shut off entire demographics of audiences, just because they don’t install a new browser or swap the default search.

1 Like

Mostly our customers are cancelling their accounts due to excessive pageviews (essentially, Bing) that force them into higher paying hosting brackets.

6 Likes

Sure so the choices appear to be:

  • do nothing (always an option, but obviously impractical here)
  • limit it into oblivion while @sam talks to them, customers see an instant reduction in traffic and no potentially long-lasting impact to search
  • shut the door now, to hell with the consequences
  • block bing for hosted customers only

Coming from an enterprise background the tendency would always veer towards the first and end up going down the second avenue.

Is there some other benefit to the third option, other than beating on Microsoft and Bing because it’s fun?

Was there no scope to block it for hosted customers only, via a plugin or otherwise? Looking at other specialty hosting services for other products many only count true pageviews and ignore crawlers, is that conceivable for Discourse?

3 Likes

It’s trivial to change this default; just edit one setting in your site settings, takes all of 15 seconds. Then you can watch the Bing carnage :boom: unfold on your site too :wink:

I am beating on Bing and Microsoft at the moment because they are behaving exceptionally badly to the point that it is costing us business, as I said above. Real money. This is not an abstract concern.

8 Likes

Hi Sam,

Options 1, 2, 3 all seem like good options to me, though I feel you could have done #1, without blocking Bing. That just seems like the “lazy” option, as 2 & 3 would take more time to implement.

@codinghorror acknowledged the monopoly that exists in the search space, and you both acknowledge that Discourse actually has some weight to throw around. I would expect more then, from an organisation with weight to throw around, than to effectively muscle out a smaller player in the market due to behavior that you have deemed “bad”.

I agree, Bing may use a crawl profile which may not be ideal. It may even be categorically bad. But you have still chosen to muscle out the smaller player, who offers and obeys fairly well documented extensions to the robots.txt standards.

Like you said, many users don’t touch defaults - so how many users are about to deploy Discourse, only to find that they never get indexed on Bing, due to those defaults. How many users are going to be harmed, should Bing establish greater market penetration in a local market. What happens when “Cortana” suddenly becomes the killer voice assistant? Microsoft already sits on 89% of desktops worldwide, if they play their cards right, it could happen. Essentially, you’ve burdened your users with risk, simply because you have the weight to do so, to stomp your feet, and get your way.

So please, do #1, #2, and #3. Kick up a fuss on Twitter. Message the Bing team. Contact your Microsoft representatives. It’s not like the names “Sam Saffron” and “Jeff Atwood” are so unknown that you’re not going to get a response from direct contact, either via email, or via Twitter. But don’t just blanket block a smaller player in a market, because they’re not doing what you would like.

Indeed, you could even have blocked Bing due to it’s low market share. It’s not that unusual for companies to exclude support for various low-impact platforms (and at the end of the day, you still must have the best interests of your own company at heart). It is rather unusual to boast about it though.

3 Likes

but …

So which one is it - defaults are trivial - or people don’t touch them :wink:

2 Likes

Just putting this out here, while discussing this with Microsoft (the engineers who build Bing and PM for Bing)

  • Microsoft never told me that this blocking will cause irreparable damage to sites ranking

  • Microsoft suggested a site map as a workaround

  • Microsoft did not make an explicit recommendation to use Crawl delay vs blocking

  • Microsoft said they want to fix the underlying issue

We are testing the site map theory.

My gut feel on this is that crawl delay vs straight out telling bot not to crawl will have almost the same effect if this only goes on for a month or so. long term crawl delay is not a proper solution cause the backlog they have is too huge and they can not work through all the urls they want to.

We are only telling Bing not to crawl, we say nothing on the pages asking Bing or any crawler not to index using meta tags

So, basically what we have here is a bunch of people who have a gut feel that due to SEO reasons blocking a crawler for N days via robots.txt while we work this out with Bing will damage sites forever in Bing. My gut feel is that the old content will remain in Bing for at least a few months if not more.

Now, if Microsoft tell me that what we are doing will have irreparable damage going forward, I think it would have more weight.

If anything brand new sites deploying Discourse at the moment have more of a chance finding this out that we messed with Bing and find this topic. Crawl delay would mask that.

18 Likes

It isn’t a completely random gut feeling – about 10 years ago part of my job was fixing large sites so that they could be crawled more easily (which can make a big difference in indexing and ranking – sometimes sites even get accidentally banned in search results due to crawling issues).

I’ve seen it happen with Google, and it was just an accidental Disallow: / for a few days.

My anecdotal evidence here is that this does not happen anymore, we had Google blocked from crawling meta for 20 days a while back due to a bug in our robots file, the blocking made zero difference long term.

Also, we are talking about Bing here not Google, so evidence about what Google did 10 years ago is not particularly strong when we are talking about Bing.

Besides

  • The Microsoft PM for Bing is probably reading this topic
  • Multiple Engineers on the Bing team are reading this topic

None of them said

Sam stop, if you block us for a few weeks you are going to destroy everything in our index for your Discourse sites. What they seem to be saying is that stuff needs fixing and they want to get to a state where we are comfortable unblocking Bing again.

We are running multiple experiments at the moment. Microsoft are working on the problem. In the meantime while this is happening “bleeding by default” is reduced to zero.

19 Likes

From my experience Bingbot always needs a little bit more help when it comes to crawling, especially caching crawled results.
Bingbot for example ignores E-Tag / I’ve never seen Bingbot asking for E-Tags.

HTTP-Headers of discourse explicitly tell NOT to cache the results (“cache-control: no-store, must-revalidate, no-cache, private”)

Google tries to apply heuristics to ignore those headers if they don’t feel they are helpful.
(Doesn’t mean that it works all the time. Have seen Google crawling a random 404 of a product page on a shop more than once every minute for months)
Bingbot sticks more with what webmasters tell them. So if webmasters tell them to ignore the crawled page they will crawl again.

3 Likes

Update: Bing will now be heavily throttled - one request per 60 seconds - while this is being worked on.

https://github.com/discourse/discourse/commit/6179c0ce51bc1d9d814a1baae354d68eb491e9fd

There has also been an update to the original #feature:announcements topic.

9 Likes