Pages listed in the robots.txt are crawled and indexed by Google

rbrlortie · October 23, 2018, 7:23pm

Pages meant to be hidden from Google are in the robots.txt

However, Google attempts to crawl them anyway.

Since they are accessible through links on web pages, they get indexed. The crawler then hits the robots.txt and crawl “blank”/“blocked”.

This results in a lot of error in our analytics console and wasted crawls.

The simple solution would be to use a noindex meta tag. Could there be a way to add this on a page by page basis?

pfaffman · October 23, 2018, 8:40pm

Since you are running in a sub-folder you’re on your own to generate the appropriate robots.txt file since the one Discourse generates is in community/forum/robots.txt (though it seems your case is that robots.txt doesn’t matter if external sites link to a profile?).

You could also enable the hide user profiles from public site setting. That will “Disable user cards, user profiles and user directory for anonymous users.” which would keep Google away from them.

codinghorror · October 23, 2018, 9:11pm

My bad @rbrlortie I didn’t realize this was subfolder, so that is a different animal deserving of its own topic.

My response is pretty much what @pfaffman said, above ↑

Since Discourse does not control the top level of the website, Discourse has no control over robots.txt in this scenario. You’ll need to generate it yourself.

rbrlortie · October 23, 2018, 9:24pm

Hi you two,

Thanks for the help. However, the robots.txt is not the problem.

I have it present on my root and it also contains the pages that google is crawling.

https://www.robotshop.com/robots.txt

#discourse
Disallow: /community/forum/auth/
Disallow: /community/forum/assets/browser-update*.js
Disallow: /community/forum/users/
Disallow: /community/forum/u/
Disallow: /community/forum/my/
Disallow: /community/forum/badges/
Disallow: /community/forum/search
Disallow: /community/forum/search/
Disallow: /community/forum/tags
Disallow: /community/forum/tags/
Disallow: /community/forum/email/
Disallow: /community/forum/session
Disallow: /community/forum/session/
...

The issue is that those are not respected by crawlers. The only way to make sure Google isn’t indexing content is by adding the “noindex” meta tag.

See the officlal response by Google on their official YouTube channel: https://www.youtube.com/watch?v=rx9WslZH2ag

“One thing maybe to keep in mind here is that if these pages are blocked by robots.txt, then it could theoretically happen that someone randomly links to one of these pages. And if they do that then it could happen that we index this URL without any content because its blocked by robots.txt. So we wouldn’t know that you don’t want to have these pages actually indexed.

Whereas if they’re not blocked by robots.txt you can put a noindex meta tag on those pages. And if anyone happens to link to them, and we happen to crawl that link and think “maybe there’s something useful here” then we would know that these pages don’t need to be indexed and we can just skip them from indexing completely.

So, in that regard, if you have anything on these pages that you don’t want to have indexed then don’t disallow them, use noindex instead.”

Since the default Discourse behavior is to attempt to hide those pages from crawlers, in my eyes the feature is broken.

The pages in the default Discourse robots.txt should have the <meta name="robots" content="noindex"> present.

codinghorror · October 23, 2018, 9:29pm

In our eyes, it is not. Feel free to submit a pull request changing the behavior if you want it changed.

rbrlortie · October 23, 2018, 9:40pm

Apologies if my wording was taken harsher that it was meant to.

What I mean is that pages in the robots.txt are getting crawled and are showing up publicly on Google

This causes around one error in Google Analytics for each member every time Google crawls our domain.

rbrlortie · October 23, 2018, 9:56pm

@pfaffman hide user profiles from public is indeed what we are using right now to stop the errors from flooding in our analytics.

codinghorror · October 23, 2018, 10:09pm

This is an invalid search, it’s basically typing the URL in minus punctuation.

My working premise is that any real search terms would never lead you to this result. Hence the meaninglessness of the “error”.

Stephen · October 24, 2018, 8:13am

I think the concern is that as Google gives each site a finite amount of crawl time, people want noindex so that none of those resources are squandered crawling and erroring on pages they didn’t want indexed in the first place.

From that perspective alone it begins to make sense. I’ve seen how long it can take for Google to crawl and discover all the content on large freshly indexed sites. It can be a matter of days for Google to crawl it all and much much longer before it figured out where it should be checking more frequently.

codinghorror · October 24, 2018, 10:20am

The logic doesn’t make sense; robots.txt is specifically there to exclude content from webcrawlers. That is its entire purpose for existing.

After arriving at a website but before spidering it, the search crawler will look for a robots.txt file. If it finds one, the crawler will read that file first before continuing through the page. Because the robots.txt file contains information about how the search engine should crawl, the information found there will instruct further crawler action on this particular site. If the robots.txt file does not contain any directives that disallow a user-agent’s activity (or if the site doesn’t have a robots.txt file), it will proceed to crawl other information on the site.

What Google does is respect that (sort of) but crawl it anyway in a “just in case” manner. So I’d expect this “just in case” crawling is already lower priority than crawling, y’know, what isn’t explicitly excluded from webcrawlers…

codinghorror · October 24, 2018, 10:28am

Huh, it looks like noindex may be supported in robots.txt already?

And it seems it works

Ultimately, the NoIndex directive in Robots.txt is pretty effective. It worked in 11 out of 12 cases we tested. It might work for your site, and because of how it’s implemented it gives you a path to prevent crawling of a page AND also have it removed from the index.

cc @sam this will be the easiest way.

rbrlortie · October 24, 2018, 1:54pm

Seems like Robots.txt Noindex: was the piece of the puzzle that we were all missing.

Should we use it, I’ll come back here with an update. So far we had “fixed” the issue by making user profiles hidden to unregistered users.

Thanks for the support!

codinghorror · October 29, 2018, 6:12pm

Feel free to reassign this if needed @sam so it gets done.

sam · November 2, 2018, 5:40am

Completed per:

https://github.com/discourse/discourse/commit/d84256a876a9fa4fc7bcb4b8ac8c5865f8c10701

rbrlortie · November 2, 2018, 5:36pm

Awesome! Google kept nagging everyone about it. Now it’s finally behind us.

system · December 2, 2018, 5:36pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

codinghorror · July 30, 2019, 1:32am

Yet again, craptacular SEO “experts” have ed up the joint.

Google confirmed this was never supported and never actually worked. We need to revert this work @sam. I’ll delete all the other dupe topics on this.

sam · July 30, 2019, 1:34am

Reverted per:

https://github.com/discourse/discourse/commit/5feb342914d30de72d19d97900fb58e5447d712a

codinghorror · July 30, 2019, 1:36am

We should probably backport that as well.

sam · July 30, 2019, 1:37am

Sure … done, this is now backported

Topic		Replies	Views
Excluding user profiles in robots.txt (or allow edit of file) Feature	5	2492	May 24, 2014
How to noindex all pages to keep them out of Google indexes Support	8	2694	February 22, 2024
Generic rules in "robots.txt" not picked up by Googlebot Support	6	989	April 2, 2022
Google changed how they process robots.txt in Discourse? Support	20	1632	December 22, 2020
Why there are lots of Disallow rule in robots.txt? Support	34	4521	December 22, 2020

Pages listed in the robots.txt are crawled and indexed by Google

Related topics