Google Search Console started sending me error reports about some Discourse pages.
URL is on Google, but has issues
It can appear in Google Search results (if not subject to a manual action or removal request). However, some issues prevent it from being eligible for all enhancements. [Learn more (URL Inspection Tool - Search Console Help)
My forum’s welcome page is one of the pages that they complain about – it refers to user pages (e.g. this one) that are “Indexed, though blocked by robots.txt”.
I am using the normal Docker-based Discourse setup – it’s behind nginx for SSL purposes.
Any ideas what’s happening here? I’ve upgraded Discourse and set an additional test in the search console, but the problem still persists.
This one is odd @codinghorror we saw the same thing with Bing on the user pages, should we just add that magic noindex meta tag here to double ensure user pages do not sneak in to indexes?
People on here keep asking for MORE SEO JUICE and INTERLINKING every so often and I honestly can’t think of anything better along those lines than linking to the most-liked posts + recent posts you’ve made.
Also, a forum user page is a valid search result for someone who wants to see your recent posts.
To put it another way: from Google’s perspective, internal links out of user pages are a pretty great page rank signal and this is them saying “hey it would be nice if you could let us use these”
We did the “please index the user page” conversation so many times though… in fact, to be honest, I think even mentioning this automatically raises the blood temperature in Jeff’s head automatically by 10 degrees.
Let’s keep this focused on just properly excluding the user page so we can stop dealing with … “oh you excluded the user page but did not exclude it for realz complaint”.
As someone that has been a moderator for several years and seen countless examples of what I call “profile spam” I beg, please no.
To this day there are a large number of account registrations every day, the members reading no posts, making no posts, but populating their profile to advertise a business. And this with profiles that are only visible to other registered members. I can only imagine this would become even more of a problem than it already is if they became indexable.
True, there might be some value in exposing links to posts to search engines, but they are already exposed elsewhere and don’t need to be also exposed from profile pages.
EDIT
trust level 2 and above could work. By the time a member makes it to trust level 2 they can be, ermm, more trustworthy.
Using the noindex tag is the right thing anyway, as far as I understand: The discussion is about whether we want Google to index the user pages, but robots.txt does not prevent indexing:
You should not use robots.txt as a means to hide your web pages from Google Search results.
This is because other pages might point to your page, and your page could get indexed that way, avoiding the robots.txt file.
I think this is what the message posted by @metakermit is hinting at: Google is complaining that it has indexed some pages it couldn’t retrieve, so the index entries are not really good entries.
Because of this, I think letting Google crawl these pages, with links to the users most relevant posts, is definitely a good thing: It helps search engines discover content and better understand which content is the most relevant.
Whether we want user pages to be listed in the index is a separate decision, and I agree that the two reasonable options are not at all or just for trusted users.
Yeah, I don’t know if there’s any practical difference or not. I just think that it’s not a great experience if Discourse forum admins get emails from the Google Search Console that their site has failed validation (I assume others are experiencing this as well, since I didn’t make any special changes to my robots.txt).
For my specific forum it would be great to have user pages indexed, it is a paid site so we don’t have the spammer problem. We have a group of photographers whom we actually want to help promote their work. All of our users automatically go to TL2 so that would work for us, at least having the option to turn this on would be fantastic.
This just started happing on my discourse install (self hosted). It reporting this for all the rss links
e.g
<link rel="alternate" type="application/rss+xml" title="RSS feed of 'Is there a performance penalty in calling an external FB script?'" href="https://forums.finalbuilder.com/t/is-there-a-performance-penalty-in-calling-an-external-fb-script/6238.rss" />
I just wanted to check whether this change should mean that Search Console won’t issue warnings about “Indexed, though blocked by robots.txt” now? As I have alerts for several T2+ users, admin/users/list/new and strangely, search?q=%7Bsearch_term_string%7D.
Is there any way to stop these alerts from being sent?
I agree with Drazen, it’s annoying getting “Indexed, though blocked by robots.txt” emails from Google about user pages (/u/userName) when there apparently isn’t actually a problem.