Pages listed in the robots.txt are crawled and indexed by Google


#1

Pages meant to be hidden from Google are in the robots.txt

However, Google attempts to crawl them anyway.

Since they are accessible through links on web pages, they get indexed. The crawler then hits the robots.txt and crawl “blank”/“blocked”.

This results in a lot of error in our analytics console and wasted crawls.

The simple solution would be to use a noindex meta tag. Could there be a way to add this on a page by page basis?


How to Noindex thin & duplicate pages in Discourse?
(Jay Pfaffman) #2

Since you are running in a sub-folder you’re on your own to generate the appropriate robots.txt file since the one Discourse generates is in community/forum/robots.txt (though it seems your case is that robots.txt doesn’t matter if external sites link to a profile?).

You could also enable the hide user profiles from public site setting. That will “Disable user cards, user profiles and user directory for anonymous users.” which would keep Google away from them.


(Jeff Atwood) #3

My bad @rbrlortie I didn’t realize this was subfolder, so that is a different animal deserving of its own topic.

My response is pretty much what @pfaffman said, above ↑

Since Discourse does not control the top level of the website, Discourse has no control over robots.txt in this scenario. You’ll need to generate it yourself.


#4

Hi you two,

Thanks for the help. However, the robots.txt is not the problem.

I have it present on my root and it also contains the pages that google is crawling.

https://www.robotshop.com/robots.txt

#discourse
Disallow: /community/forum/auth/
Disallow: /community/forum/assets/browser-update*.js
Disallow: /community/forum/users/
Disallow: /community/forum/u/
Disallow: /community/forum/my/
Disallow: /community/forum/badges/
Disallow: /community/forum/search
Disallow: /community/forum/search/
Disallow: /community/forum/tags
Disallow: /community/forum/tags/
Disallow: /community/forum/email/
Disallow: /community/forum/session
Disallow: /community/forum/session/
...

The issue is that those are not respected by crawlers. The only way to make sure Google isn’t indexing content is by adding the “noindex” meta tag.

See the officlal response by Google on their official YouTube channel: YouTube

“One thing maybe to keep in mind here is that if these pages are blocked by robots.txt, then it could theoretically happen that someone randomly links to one of these pages. And if they do that then it could happen that we index this URL without any content because its blocked by robots.txt. So we wouldn’t know that you don’t want to have these pages actually indexed.

Whereas if they’re not blocked by robots.txt you can put a noindex meta tag on those pages. And if anyone happens to link to them, and we happen to crawl that link and think “maybe there’s something useful here” then we would know that these pages don’t need to be indexed and we can just skip them from indexing completely.

So, in that regard, if you have anything on these pages that you don’t want to have indexed then don’t disallow them, use noindex instead.”

Since the default Discourse behavior is to attempt to hide those pages from crawlers, in my eyes the feature is broken.

The pages in the default Discourse robots.txt should have the <meta name="robots" content="noindex"> present.


(Jeff Atwood) #5

In our eyes, it is not. Feel free to submit a pull request changing the behavior if you want it changed.


#6

Apologies if my wording was taken harsher that it was meant to.

What I mean is that pages in the robots.txt are getting crawled and are showing up publicly on Google

This causes around one error in Google Analytics for each member every time Google crawls our domain.
image


#7

@pfaffman hide user profiles from public is indeed what we are using right now to stop the errors from flooding in our analytics.


(Jeff Atwood) #8

This is an invalid search, it’s basically typing the URL in minus punctuation.

My working premise is that any real search terms would never lead you to this result. Hence the meaninglessness of the “error”.


(Stephen) #9

I think the concern is that as Google gives each site a finite amount of crawl time, people want noindex so that none of those resources are squandered crawling and erroring on pages they didn’t want indexed in the first place.

From that perspective alone it begins to make sense. I’ve seen how long it can take for Google to crawl and discover all the content on large freshly indexed sites. It can be a matter of days for Google to crawl it all and much much longer before it figured out where it should be checking more frequently.


(Jeff Atwood) #10

The logic doesn’t make sense; robots.txt is specifically there to exclude content from webcrawlers. That is its entire purpose for existing.

After arriving at a website but before spidering it, the search crawler will look for a robots.txt file. If it finds one, the crawler will read that file first before continuing through the page. Because the robots.txt file contains information about how the search engine should crawl, the information found there will instruct further crawler action on this particular site. If the robots.txt file does not contain any directives that disallow a user-agent’s activity (or if the site doesn’t have a robots.txt file), it will proceed to crawl other information on the site.

What Google does is respect that (sort of) but crawl it anyway in a “just in case” manner. So I’d expect this “just in case” crawling is already lower priority than crawling, y’know, what isn’t explicitly excluded from webcrawlers…


(Jeff Atwood) #11

Huh, it looks like noindex may be supported in robots.txt already?

And it seems it works

Ultimately, the NoIndex directive in Robots.txt is pretty effective. It worked in 11 out of 12 cases we tested. It might work for your site, and because of how it’s implemented it gives you a path to prevent crawling of a page AND also have it removed from the index.

cc @sam this will be the easiest way.


#13

Seems like Robots.txt Noindex: was the piece of the puzzle that we were all missing.

Should we use it, I’ll come back here with an update. So far we had “fixed” the issue by making user profiles hidden to unregistered users.

Thanks for the support!


(Jeff Atwood) #14

Feel free to reassign this if needed @sam so it gets done.


(Sam Saffron) #15

Completed per:


#16

Awesome! Google kept nagging everyone about it. Now it’s finally behind us.