Google changed how they process robots.txt in Discourse?

jomaxro · May 12, 2020, 4:59pm

To be clear this has nothing to do with being a discussion forum. It is related to the … interesting … way Google treats robots.txt. Per Robots.txt Introduction and Guide | Google Search Central | Documentation | Google for Developers

A robotted page can still be indexed if linked to from from other sites
While Google won’t crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely).

We’ve long included pages we don’t want indexed in the default robots.txt file each Discourse site has. This previously worked just fine. At an unknown point in past this was no longer enough, Google decided to index pages linked from elsewhere even if disallowed via a robots.txt.

So earlier this year we started testing including noindex headers on certain pages. This would work great, except for we now end up with a clash between robots.txt and the header. Per Bloccare l'indicizzazione della Ricerca con noindex | Google Search Central | Documentazione | Google for Developers

Important! For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.

Which leads us to today. We’re testing the removal of certain pages from the robots.txt. We have to be careful, as we’re making all these changes based on the Google documentation, so we know we’re good with Googlebot, but need to also check other major crawlers to ensure we’re not going to cause issues there.

Topic		Replies	Views
Pages listed in the robots.txt are crawled and indexed by Google Support	19	3224	July 30, 2019
Google complaining – Indexed, though blocked by robots.txt Support	24	2421	September 28, 2023
Generic rules in "robots.txt" not picked up by Googlebot Support	6	977	April 2, 2022
Google notification to remove "noindex" statements from robots.txt Support	8	2418	July 30, 2019
Issues Google Search Console is throwing at me for wrong discourse structure (or some for wrong administration of my site) Support	18	103	December 18, 2024

Google changed how they process robots.txt in Discourse?

Related topics