Google complaining – Indexed, though blocked by robots.txt

metakermit · September 5, 2018, 12:09pm

Google Search Console started sending me error reports about some Discourse pages.

URL is on Google, but has issues

It can appear in Google Search results (if not subject to a manual action or removal request). However, some issues prevent it from being eligible for all enhancements. [Learn more (URL Inspection Tool - Search Console Help)

My forum’s welcome page is one of the pages that they complain about – it refers to user pages (e.g. this one) that are “Indexed, though blocked by robots.txt”.

I am using the normal Docker-based Discourse setup – it’s behind nginx for SSL purposes.

Any ideas what’s happening here? I’ve upgraded Discourse and set an additional test in the search console, but the problem still persists.

sam · September 6, 2018, 2:02am

This one is odd @codinghorror we saw the same thing with Bing on the user pages, should we just add that magic noindex meta tag here to double ensure user pages do not sneak in to indexes?

codinghorror · September 6, 2018, 3:26am

I guess, if it gets people to stop bugging us about it. It makes no difference in practice other than reducing the nagging.

riking · September 6, 2018, 5:39am

Or we could put useful content on the search engine / text browser view of user pages, like links to your top replies and top topics

sam · September 6, 2018, 5:41am

hmmm … no… we purposely do not want to index user pages. besides the spam factor what kind of value does this provide.

riking · September 6, 2018, 5:44am

People on here keep asking for MORE SEO JUICE and INTERLINKING every so often and I honestly can’t think of anything better along those lines than linking to the most-liked posts + recent posts you’ve made.

Also, a forum user page is a valid search result for someone who wants to see your recent posts.

To put it another way: from Google’s perspective, internal links out of user pages are a pretty great page rank signal and this is them saying “hey it would be nice if you could let us use these”

sam · September 6, 2018, 5:47am

We did the “please index the user page” conversation so many times though… in fact, to be honest, I think even mentioning this automatically raises the blood temperature in Jeff’s head automatically by 10 degrees.

Let’s keep this focused on just properly excluding the user page so we can stop dealing with … “oh you excluded the user page but did not exclude it for realz complaint”.

codinghorror · September 6, 2018, 6:38am

Hmm we could make it so TL2 and higher get indexed user pages @sam. That would be consistent with nofollow rules in posts too.

Mittineague · September 6, 2018, 6:51am

It raised mine by more than a few mm Hg

As someone that has been a moderator for several years and seen countless examples of what I call “profile spam” I beg, please no.

To this day there are a large number of account registrations every day, the members reading no posts, making no posts, but populating their profile to advertise a business. And this with profiles that are only visible to other registered members. I can only imagine this would become even more of a problem than it already is if they became indexable.

True, there might be some value in exposing links to posts to search engines, but they are already exposed elsewhere and don’t need to be also exposed from profile pages.

EDIT
trust level 2 and above could work. By the time a member makes it to trust level 2 they can be, ermm, more trustworthy.

sam · September 6, 2018, 7:27am

I do like the idea of allowing this to TL2, agree it is consistent with the nofollow stuff.

codinghorror · September 6, 2018, 7:28am

If you are adding the per-page tag it is almost the same code anyway, and we’d need to remove the robots.txt user path block anyhow.

fefrei · September 6, 2018, 8:20am

Using the noindex tag is the right thing anyway, as far as I understand: The discussion is about whether we want Google to index the user pages, but robots.txt does not prevent indexing:

You should not use robots.txt as a means to hide your web pages from Google Search results.
This is because other pages might point to your page, and your page could get indexed that way, avoiding the robots.txt file.

I think this is what the message posted by @metakermit is hinting at: Google is complaining that it has indexed some pages it couldn’t retrieve, so the index entries are not really good entries.

Because of this, I think letting Google crawl these pages, with links to the users most relevant posts, is definitely a good thing: It helps search engines discover content and better understand which content is the most relevant.
Whether we want user pages to be listed in the index is a separate decision, and I agree that the two reasonable options are not at all or just for trusted users.

metakermit · September 6, 2018, 11:32am

Yeah, I don’t know if there’s any practical difference or not. I just think that it’s not a great experience if Discourse forum admins get emails from the Google Search Console that their site has failed validation (I assume others are experiencing this as well, since I didn’t make any special changes to my robots.txt).

davidkingham · September 6, 2018, 3:56pm

For my specific forum it would be great to have user pages indexed, it is a paid site so we don’t have the spammer problem. We have a group of photographers whom we actually want to help promote their work. All of our users automatically go to TL2 so that would work for us, at least having the option to turn this on would be fantastic.

vincentp · October 10, 2018, 11:04pm

This just started happing on my discourse install (self hosted). It reporting this for all the rss links
e.g

<link rel="alternate" type="application/rss+xml" title="RSS feed of &#39;Is there a performance penalty in calling an external FB script?&#39;" href="https://forums.finalbuilder.com/t/is-there-a-performance-penalty-in-calling-an-external-fb-script/6238.rss" />

These links are on every topic page, perhaps it might be be better to just let google index the rss feeds?

sam · November 2, 2018, 5:40am

Can you see if this:

https://github.com/discourse/discourse/commit/d84256a876a9fa4fc7bcb4b8ac8c5865f8c10701

Sorts out your problem?

vincentp · November 2, 2018, 1:03pm

Thanks, I have updated to the latest (beta3+199) and will request re-indexing… might take a week or so to know for sure if it’s sorted.

alexs · August 28, 2019, 4:18pm

I just wanted to check whether this change should mean that Search Console won’t issue warnings about “Indexed, though blocked by robots.txt” now? As I have alerts for several T2+ users, admin/users/list/new and strangely, search?q=%7Bsearch_term_string%7D.

Is there any way to stop these alerts from being sent?

alicate · October 30, 2019, 10:47pm

I agree with Drazen, it’s annoying getting “Indexed, though blocked by robots.txt” emails from Google about user pages (/u/userName) when there apparently isn’t actually a problem.

codinghorror · October 31, 2019, 3:16am

Then update to latest beta, where this is addressed.

Topic		Replies	Views
Google indexing issue (robots.txt) Support	9	672	May 23, 2024
Why there are lots of Disallow rule in robots.txt? Support	34	4521	December 22, 2020
Issues Google Search Console is throwing at me for wrong discourse structure (or some for wrong administration of my site) Support	18	142	December 18, 2024
Pages listed in the robots.txt are crawled and indexed by Google Support	19	3245	July 30, 2019
Why isn't Google Indexing Discourse? SEO concerns Support seo	31	5184	June 1, 2024

Google complaining – Indexed, though blocked by robots.txt

Related topics