So there have been a number of conversations around the robots.txt file, and how some need / don’t need to edit it based on their specific use-case. What I haven’t seen is exactly where to edit it (knowing that after an upgrade one would need to re-edit the file again). I see in /var/www/discourse/app/views there is a robots.txt directory with some ruby files but not an actual robots.txt file.
Yep - those are valid settings. However, with the newest changes Google keeps making, we need to add /latest* to the Disallow list. These end up getting classified as “thin links” by Google and in turn negatively affect site ranking for search over time. For instance this turns up a lot of records which are not valid
It would benefit everyone if we at least could get Latest added to the robots.txt file
If this is something that would be of benefit to many people, consider making a PR against core to add it either as a default, or behind a site setting.
I would be extremely against this cause how would google then be able to find new content? Latest is very much required.
@codinghorror what I totally support here is adding noindex, follow to meta tags for /latest and even category topic lists, it will clean up a bunch of messy results. Thoughts?
I am not really following? What’s the problem? Our meta.discourse.org Google Search Console has been running for years, we check it every few months in quite some detail, and I’ve never seen any problems with this reported?
The problem is that we store “latest” content in the search index which is pointless cause we never want people to land on “latest”. The tricky thing is that any mucking around here can rock the boat and this is not a boat we want to rock.
I have seen rare reports of this kind of stuff with user profiles not have noindex in meta tags, but not the reverse.
I’m gonna need to see some screenshot Google Search Console proof to believe this is a problem, because we’ve checked our console pretty closely for years.
Not sure why we’re operating on the assumption that user pages and other useful pages that have healthy internal linking shouldn’t be indexed when Google’s own forums (e.g. Google Product Forums) allow indexing of user profiles. Same with Reddit, Quora and even Stackoverflow I guess the biggest forum-like sites in the world and Google themselves must be bad at SEO.
Discourse is free and beggars can’t be choosers, please don’t take this as complaining and I appreciate the work you’re all doing. I’m just attempting to give objective helpful feedback.
A huge % of search engine queries are for people, including usernames. On Google Trends, most of the top searches are for people, 14/20 at the time of posting this (70%). Google Trends
Yes, most of these people are celebrities of some sort and this is at macro level. At micro level, there are still lots of searches for people, smaller influencers, bloggers etc. and are some of the most popular searches.
If I search for ‘{user} on {site_name}’ the most relevant result would be that user’s profile, where I can see all their posts. Or if I’m searching for a particular person because I enjoy their content on other platforms, it’d be nice to have the ability to find them on a Discourse forum.
Google knows how to sort results, usually. Having the profiles indexed gives the ability to show the profiles for the searches that they are most relevant for.
The main thing in my opinion, is if not indexing profiles helps stop spammers due to not getting link juice from non-indexed profiles - using the same logic, link juice won’t flow internally through our sites as efficiently. Nofollow tags for all profile links to external URLs would achieve the same result.
You are completely hijacking a topic here that is asking for removal of items from the Google index.
Feel free to open a dedicated topic to discuss this. There are changes we do support like having a reasonable HTML off version of the user summary page… however you are doing is totally derailing this topic. We have survived fine for the last 5 years without indexing user pages, lots of users DO NOT want their forum user page indexed by Google. Tons of profiles are thin and pointless. Turning this on, even just for TL2, can burn a lot of forums in Google cause of the dupe content.
Thanks for the response. I tried to make a feature topic but it was locked and going to be deleted, I didn’t think it would be considered a duplicate as there do not appear to be other feature request threads specifically about this. There are several support posts asking how to edit the robots.txt file.
I searched for a relevant thread and thought that this was the second thread being referenced as the reason for closing the topic. My apologies. Having a way to edit the robots.txt as the title of this thread says, would solve indexing issues for everyone
Ok understood, most of the arguments for excluding the user profiles I’ve seen are mainly for SEO, duplicate content etc. which I don’t agree with. I didn’t see people not wanting their profiles indexed on a public forum as one of the primary reasons to exclude them, it is a valid reason, just I don’t think it was part of the decision making process (at least publicly) when they were originally removed: Excluding user profiles in robots.txt (or allow edit of file)
I’ll stop posting in this thread now anyhow. Thank you.
The following came as an explanation and supporting use case for why this could/should be edited from our SEO Manager:
The issue here lies in the fact that this is indexed and seen as a) duplicate content b) thin content c) does not provide value and d) affects page (and domain) authority increases.
There is no Google Search Console error, warning, or penalties placed, yet (“Thin content with low or no added value.” manual penalty is the most common penalty). Utilizing other SEO-related tools, this does surface. If requested, I can absolutely provide screenshots of these reports, just let me know.
Moz Pro site crawl reports, for example, mark these Latest pages as having multiple content errors for a variety reasons and the issues associated with these pages include (based upon our domain reports, yet seemingly relevant to others): URL Too Long, Thin Content, Missing Canonical Tag, Overly Dynamic URL, Duplicate Titles and Descriptions, Missing H1, etc.
The question here now seems to be “remove or improve?”. Yes one option is to simply add the noindex, nofollow meta tag values and include this in robots.txt and/or yes, another option is to simply add canonicalization to these duplicates, referencing back to the main /latest page. Also, yes, another option is to keep all of these pages indexed, perhaps then focusing on improvements and optimizations to these pages so the content truly does provide unique value and addresses correcting those issues noted above.