Handling Bingbot

(Josh) #1

I couldn’t comment on this post, so will reply here.

I think it might cause problems for websites to completely block Bing like that. Most people won’t see the announcement, and Bing has significant usage in the US and Japan (via Yahoo).

Here’s an alternate way to slow Bing down without risking sites’ search engine rankings:

Blocking a search engine with robots.txt can result in loss of rankings that can be difficult to recover from (personal experience).

It looks like this setting would be safe:

User-agent: msnbot 
Crawl-delay: 1

(Jay Pfaffman) #2

This does seem like a less nuclear response and it makes sense that once blocked, it might be hard to convince Bing to try again (or so one would hope!).

What do you think, @sam?

(Jeff Atwood) #3

No, the nuclear option is what we want here.

Ultimately at 1 request per second if Bing is indexing badly the results are going to be terrible regardless. Think about how spiders work, how many links they need to visit (per day? Per week?) and what a bad one would do.

(Josh) #4

The speed of the crawler won’t necessarily affect rankings, though blocking it will. Based on personal experience, blocking a search engine bot can lead to a permanent drop. If this change is kept, I think it’s going to hurt some sites, and many webmasters won’t be aware of what happened…

(Sam Saffron) #5

I am working with bing at the moment on debugging this, if the default becomes friendly again we will reconsider.

(Josh) #6

If anyone pulls that code in the meantime, they might get an unpleasant surprise. Imagine the webmaster proud of their rankings in Bing who suddenly sees their position drop and possibly never return. I only saw the announcement about how to override it by chance, so I’m guessing that most people won’t know what happened. It can be dangerous to mess around with robots.txt…

(Sam Saffron) #7

I am trying really really really hard… nothing is happening… will try harder

:troll: I had to … you walked right into that.

Anyway, will see if the giant Microsoft boat can be moved here, this is for the greater good cause if they fix this issue they correct crawling for every Discourse site regardless of the version it is running and probably a ton of non Discourse sites.

(Jeff Atwood) #8

Bing is an insignificant source of traffic. If you can prove otherwise with actual data from a site you personally operate please do so. If you cannot, well…

(Chris Beach) #9

My data (se23.life) agrees:

About 1% of visits.

(Jeff Atwood) #10

Which would be fine if it didn’t hit sites 20x harder than Google… lots of replies on Twitter corroborated this bad behavior so it is not just Discourse either.

Tweet 1
Tweet 2
Tweet 3
Tweet 4

(Josh) #11

I can see how it’s difficult to believe, but a lot of small business owners do put a lot of value on it. :slight_smile:

Especially in the early days, forums need to capture a few regular posters to build the community. Those few critical users might come from any source, including Bing or Yahoo. If you only have 10 visitors per day, then losing 10% of your traffic might include one of those early enthusiasts. (The numbers are just arbitrary examples.) Sometimes just a few users are very important for a site’s growth and culture.

Sure, my web stats show that some of the regulars on one site return because they come back via a search from one of those Bing-powered sources. For example, users who arrive via DuckDuckGo spend twice as much time (over 8 minutes) on my site as the average user.

Bing powers Yahoo, which is much more popular in places like Taiwan and Japan. DuckDuckGo also uses Bing, among other sources. An example where it could matter is a travel site. Some of the users might be from countries where they are using Yahoo or they have Windows 10 and didn’t change the default settings. They might no longer find Discourse sites when they search the Web.

Bing might not seem like a major source of traffic, but I help a lot of people with computer use (via organizing a programming club in Berkeley) and it’s surprising how many of the Windows 10 users end up on Bing one way or another.

Also, a lot of small business owners just love to show up in search engine results regardless of the effects on their businesses. :slight_smile:

Just some friendly, constructive feedback…

(Josh) #12

I don’t think that only percentages should be considered. One of those 7,447 Bing users might be the one who starts the topic that gets a million page views.

(Sam Saffron) #14

Since I just got this in Twitter, it kind of hit a nerve

Furthermore I was not particularly nice to Bing earlier in the topic with my trollish remark and have been extremely pleasantly surprised by the amount of listening Bing staff are willing to do here.

Corporate consolidation is a HUGE problem

Like it or not, if you live in the US you are probably searching using either Google or Bing

If you are using:

Yahoo :arrow_right: you are actually using Bing of Google behind it
DuckDuckGo :arrow_right: you are using Yahoo which in turn uses Google or Bing

In the USA Bing is pretty much the only alternative to Google, shutting Bing out is basically handing the keys to Google and letting there be no competition.

Google crawl extremely well, there is a reason they do not trust stuff like Crawl-delay, they know better, and they decided you don’t need this.

Bing on the other hand have algorithms that struggle extremely hard to detect how many new URLs a site has in a week (especially if there are a lot of canonicals), they hammer sites with HTTP requests struggling to find new content.

On a site with 2000 brand new URLs in one week, Bing will hit it with 180 thousand web requests when Google/Yandex/Baidu will get away with a few thousand. Note though that meta is not really a giant target for Yandex/Baidu so there is that.

There are a few schools of thought of how to react to this big mess:

  1. Kick up a giant political fuss hoping that Microsoft will listen and correct the bad crawling algorithm. We have some weight here cause we power a lot of sites and people do not usually touch defaults.

  2. Make Discourse somehow friendlier to crawlers so crawlers can do less work to figure out the content.

  3. Add Crawl-delay to stop the bleeding (which leaves some big open questions)

  4. Use an esoteric solution here like rate limiting by user agent, pick your poison of choice.

  5. Do nothing.

We are doing a combination here.

So yes Peter from Twitter, we are kicking up a major fuss here. We do not do this often, but sometimes we have to.

We are kicking it up cause it impacts you and other people who search. There is an internal bug in Bing or some problem implementation that makes it hammer sites it should not be hammering. This is not isolated to Discourse.

I think it is quite toxic to have the approach of “it’s just a request a second, chillax and fix your framework”. Something is ill here, very ill and it should be corrected in Bing.

I am still on the fence on whether adding Crawl-delay while working through this with Microsoft is better than outright banning, we are discussing this internally. The upside of crawl delay is that Microsoft respect it and it will cut down on traffic, the downside is that they somehow think they need 180k requests, cutting it down to a reasonable 18k requests has no guarantees it will pick the right URLs to crawl, odds are it will not but who knows.

In the meantime we are working closely with Microsoft to see if they can correct this. We are also experimenting to see what kind of impact adding a sitemap has on this problem algorithm.

(Stephen) #15

It has one benefit @sam - blocking Bing outright can cause sites to be harmed or even fall out of their index.

Rate limiting in the interim would allow Bing to retain a connection and presumably reduce any impact.

(Jeff Atwood) #16

Which is good and correct, because my working theory is that nobody will even notice. That’s how bad Bing is.

I am somewhat sympathetic to the argument that there are literally zero alternatives to Google other than Bing, which does sadly turn out to be true.

(Bas van Leeuwen) #17

We see about 3% of our search-originated traffic from Bing (97% is Google, ~0% for Yandex and Baidu).

Several of our clients are probably unable to switch (big corporations have some byzantine rules)

(Stephen) #18

I’m far from a fan of bing, and I’m hoping that’s a joke, because otherwise what you’re saying is that we’re now deciding to shut off entire demographics of audiences, just because they don’t install a new browser or swap the default search.

(Jeff Atwood) #19

Mostly our customers are cancelling their accounts due to excessive pageviews (essentially, Bing) that force them into higher paying hosting brackets.

(Stephen) #20

Sure so the choices appear to be:

  • do nothing (always an option, but obviously impractical here)
  • limit it into oblivion while @sam talks to them, customers see an instant reduction in traffic and no potentially long-lasting impact to search
  • shut the door now, to hell with the consequences
  • block bing for hosted customers only

Coming from an enterprise background the tendency would always veer towards the first and end up going down the second avenue.

Is there some other benefit to the third option, other than beating on Microsoft and Bing because it’s fun?

Was there no scope to block it for hosted customers only, via a plugin or otherwise? Looking at other specialty hosting services for other products many only count true pageviews and ignore crawlers, is that conceivable for Discourse?

(Jeff Atwood) #21

It’s trivial to change this default; just edit one setting in your site settings, takes all of 15 seconds. Then you can watch the Bing carnage :boom: unfold on your site too :wink:

I am beating on Bing and Microsoft at the moment because they are behaving exceptionally badly to the point that it is costing us business, as I said above. Real money. This is not an abstract concern.