To be clear this has nothing to do with being a discussion forum. It is related to the … interesting … way Google treats robots.txt. Per Robots.txt Introduction and Guide | Google Search Central | Documentation | Google for Developers
A robotted page can still be indexed if linked to from from other sites
While Google won’t crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely).
We’ve long included pages we don’t want indexed in the default robots.txt file each Discourse site has. This previously worked just fine. At an unknown point in past this was no longer enough, Google decided to index pages linked from elsewhere even if disallowed via a robots.txt.
So earlier this year we started testing including noindex
headers on certain pages. This would work great, except for we now end up with a clash between robots.txt and the header. Per Bloccare l'indicizzazione della Ricerca con noindex | Google Search Central | Documentazione | Google for Developers
Important! For the
noindex
directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see thenoindex
directive, and the page can still appear in search results, for example if other pages link to it.
Which leads us to today. We’re testing the removal of certain pages from the robots.txt. We have to be careful, as we’re making all these changes based on the Google documentation, so we know we’re good with Googlebot, but need to also check other major crawlers to ensure we’re not going to cause issues there.