Generic rules in "robots.txt" not picked up by Googlebot

Dear Discourse Team,

first things first: We want to salute you for the spirit and endurance on conceiving and maintaining Discourse. It is always a pleasure to use for us and our community members.

We summarized some minor observations made today and hope you will find them useful. On the other hand, we will also be happy to hear back about any misunderstandings on our side.

With kind regards,
Andreas.


Introduction

While investigating the behavior of our hosted Discourse instance at https://community.crate.io/ with respect to its robots.txt definition file [1], we discovered that Googlebot might not honor the settings as intended.

Evaluation

Let’s exercise it on behalf of the robots.txt on Meta [2], with my profile URL https://meta.discourse.org/u/amotl/.

The difference can quickly be spotted by comparing the outcomes from this free robots.txt validator:


Evaluate with Googlebot


Evaluate with MSNBot

Research

On this very topic, we believe to have found the answer already:

Thoughts

So, we are inclined to figure Googlebot might currently ignore the rules defined within the User-agent: * section [3] completely and only honors the rules within section User-agent: Googlebot [4].


  1. https://community.crate.io/robots.txt ↩︎

  2. https://meta.discourse.org/robots.txt ↩︎

  3. robots.txt, section User-agent: *

    User-agent: *
    Disallow: /admin/
    Disallow: /auth/
    Disallow: /assets/browser-update*.js
    Disallow: /email/
    Disallow: /session
    Disallow: /user-api-key
    Disallow: /*?api_key*
    Disallow: /*?*api_key*
    Disallow: /badges
    Disallow: /u
    Disallow: /my
    Disallow: /search
    Disallow: /tag
    Disallow: /g
    Disallow: /t/*/*.rss
    Disallow: /c/*.rss
    
    ↩︎
  4. robots.txt, section User-agent: Googlebot

    User-agent: Googlebot
    Disallow: /auth/
    Disallow: /assets/browser-update*.js
    Disallow: /email/
    Disallow: /session
    Disallow: /*?api_key*
    Disallow: /*?*api_key*
    
    ↩︎
4 Likes

If I understod right the answer is not that easy. If a single user has send a link using Gmail, googlebot doesn’t honor robots.txt. Same thing if a link is shared to somewhere (backlinks) where it looks to googlebot as normal everyday link.

Again — robots.txt is just a request.

There is another but more… Quite many bots identify themselves as googlebot and the reality can be found only via IP.

3 Likes

This is correct and intentionally implemented this way.

Therefore Googlebot receives an extra http-header X-Robots-Tag: noindex for pages which really should not be indexed. See:


For your own domains you may use Google Search Console → Inspect URL

Then try to add a user-profile url to index – e.g. https://www.example.com/u/jacob

4 Likes

Dear Ayke,

Confirmed.

$ http https://meta.discourse.org/u/amotl --print hH | grep -i robot
X-Robots-Tag: noindex

Thank you very much for your quick response and explanation and for referencing the corresponding patch.

With kind regards,
Andreas.

2 Likes

I moved this to #support, thank you for your delightfully written bug report here.

It has taken us so much time to finely tune our rules to keep google happy x-robots-tag noindex is unevenly supported but it is an industry standard. The issue with just banning crawling was that for some reason, under certain conditions, pages could find themselves in the google index and then there was no easy way to remove them cause crawling was banned, a bit of a chicken meets egg problem.

4 Likes

Dear Sam,

thank you very much for sharing more details about this matter. As always, I am amazed about the level of awesomeness you are pouring into every detail of Discourse.

I have to admit I haven’t been aware of this until now. So, thanks again!

Great. Thanks. If this has become a widely accepted standard now, will there be hope that some of the free robots.txt-validators might also start honoring that in the future?

Are you aware about any which already implement corresponding additional header checks on top of reading the robots.txt today, like Google Search Console’s Inspect URL does? That would probably help people to avoid the same confusion we were running into.

With kind regards,
Andreas.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.