first things first: We want to salute you for the spirit and endurance on conceiving and maintaining Discourse. It is always a pleasure to use for us and our community members.
We summarized some minor observations made today and hope you will find them useful. On the other hand, we will also be happy to hear back about any misunderstandings on our side.
With kind regards,
Andreas.
Introduction
While investigating the behavior of our hosted Discourse instance at https://community.crate.io/ with respect to its robots.txt definition file [1], we discovered that Googlebot might not honor the settings as intended.
Evaluation
Let’s exercise it on behalf of the robots.txt on Meta [2], with my profile URL https://meta.discourse.org/u/amotl/.
The difference can quickly be spotted by comparing the outcomes from this free robots.txt validator:
On this very topic, we believe to have found the answer already:
Thoughts
So, we are inclined to figure Googlebot might currently ignore the rules defined within the User-agent: * section [3] completely and only honors the rules within section User-agent: Googlebot[4].
If I understod right the answer is not that easy. If a single user has send a link using Gmail, googlebot doesn’t honor robots.txt. Same thing if a link is shared to somewhere (backlinks) where it looks to googlebot as normal everyday link.
Again — robots.txt is just a request.
There is another but more… Quite many bots identify themselves as googlebot and the reality can be found only via IP.
I moved this to support, thank you for your delightfully written bug report here.
It has taken us so much time to finely tune our rules to keep google happy x-robots-tag noindex is unevenly supported but it is an industry standard. The issue with just banning crawling was that for some reason, under certain conditions, pages could find themselves in the google index and then there was no easy way to remove them cause crawling was banned, a bit of a chicken meets egg problem.
thank you very much for sharing more details about this matter. As always, I am amazed about the level of awesomeness you are pouring into every detail of Discourse.
I have to admit I haven’t been aware of this until now. So, thanks again!
Great. Thanks. If this has become a widely accepted standard now, will there be hope that some of the free robots.txt-validators might also start honoring that in the future?
Are you aware about any which already implement corresponding additional header checks on top of reading the robots.txttoday, like Google Search Console’s Inspect URL does? That would probably help people to avoid the same confusion we were running into.