Hi you two,
Thanks for the help. However, the robots.txt is not the problem.
I have it present on my root and it also contains the pages that google is crawling.
The issue is that those are not respected by crawlers. The only way to make sure Google isn’t indexing content is by adding the “noindex” meta tag.
See the officlal response by Google on their official YouTube channel: YouTube
“One thing maybe to keep in mind here is that if these pages are blocked by robots.txt, then it could theoretically happen that someone randomly links to one of these pages. And if they do that then it could happen that we index this URL without any content because its blocked by robots.txt. So we wouldn’t know that you don’t want to have these pages actually indexed.
Whereas if they’re not blocked by robots.txt you can put a noindex meta tag on those pages. And if anyone happens to link to them, and we happen to crawl that link and think “maybe there’s something useful here” then we would know that these pages don’t need to be indexed and we can just skip them from indexing completely.
So, in that regard, if you have anything on these pages that you don’t want to have indexed then don’t disallow them, use noindex instead.”
Since the default Discourse behavior is to attempt to hide those pages from crawlers, in my eyes the feature is broken.
The pages in the default Discourse robots.txt should have the
<meta name="robots" content="noindex"> present.