Needing to edit robots.txt file - where is it?

So there have been a number of conversations around the robots.txt file, and how some need / don’t need to edit it based on their specific use-case. What I haven’t seen is exactly where to edit it (knowing that after an upgrade one would need to re-edit the file again). I see in /var/www/discourse/app/views there is a robots.txt directory with some ruby files but not an actual robots.txt file.

So where exactly does this file reside?

3 Likes

Theres a robots_txt_controller.rb which returns some text which ends up in the robots.txt file when served.

It mostly runs on site settings, which you should do your best to use over modifying the file directly.

Those settings being

  • whitelisted_crawler_user_agents
  • blacklisted_crawler_user_agents
  • slow_down_crawler_rate
7 Likes

Yep - those are valid settings. However, with the newest changes Google keeps making, we need to add /latest* to the Disallow list. These end up getting classified as “thin links” by Google and in turn negatively affect site ranking for search over time. For instance this turns up a lot of records which are not valid

It would benefit everyone if we at least could get Latest added to the robots.txt file

2 Likes

I just searched for Google documentation regarding “thin links” but came up empty. Got a link to the Google documentation handy?

Sorry, I meant thin content

And you’ve gotten “Thin content with little or no added value” message(s) on your Manual Actions Report page for the “latest” page?

Anyway, getting a bit off-topic, sorry.

You can write a plugin to add rules to the robots.txt file

4 Likes

If this is something that would be of benefit to many people, consider making a PR against core to add it either as a default, or behind a site setting.

3 Likes

I would be extremely against this cause how would google then be able to find new content? Latest is very much required.

@codinghorror what I totally support here is adding noindex, follow to meta tags for /latest and even category topic lists, it will clean up a bunch of messy results. Thoughts?

1 Like

I am not really following? What’s the problem? Our meta.discourse.org Google Search Console has been running for years, we check it every few months in quite some detail, and I’ve never seen any problems with this reported?

The problem is that we store “latest” content in the search index which is pointless cause we never want people to land on “latest”. The tricky thing is that any mucking around here can rock the boat and this is not a boat we want to rock.

I have seen rare reports of this kind of stuff with user profiles not have noindex in meta tags, but not the reverse.

I’m gonna need to see some screenshot Google Search Console proof to believe this is a problem, because we’ve checked our console pretty closely for years.

As requested, here are screenshots from one site:

This is completely unrelated and we should probably add noindex tags on user pages. This was raised before wrt Bing.

4 Likes

Not sure why we’re operating on the assumption that user pages and other useful pages that have healthy internal linking shouldn’t be indexed when Google’s own forums (e.g. Google Product Forums) allow indexing of user profiles. Same with Reddit, Quora and even Stackoverflow :wink: I guess the biggest forum-like sites in the world and Google themselves must be bad at SEO.

Discourse is free and beggars can’t be choosers, please don’t take this as complaining and I appreciate the work you’re all doing. I’m just attempting to give objective helpful feedback.

2 Likes

On what planet would you want a web search to end up on a user page in a discussion forum? That’s beyond useless.

2 Likes

A huge % of search engine queries are for people, including usernames. On Google Trends, most of the top searches are for people, 14/20 at the time of posting this (70%). Google Trends

Yes, most of these people are celebrities of some sort and this is at macro level. At micro level, there are still lots of searches for people, smaller influencers, bloggers etc. and are some of the most popular searches.

If I search for ‘{user} on {site_name}’ the most relevant result would be that user’s profile, where I can see all their posts. Or if I’m searching for a particular person because I enjoy their content on other platforms, it’d be nice to have the ability to find them on a Discourse forum.

Google knows how to sort results, usually. Having the profiles indexed gives the ability to show the profiles for the searches that they are most relevant for.

The main thing in my opinion, is if not indexing profiles helps stop spammers due to not getting link juice from non-indexed profiles - using the same logic, link juice won’t flow internally through our sites as efficiently. Nofollow tags for all profile links to external URLs would achieve the same result.

1 Like

You are completely hijacking a topic here that is asking for removal of items from the Google index.

Feel free to open a dedicated topic to discuss this. There are changes we do support like having a reasonable HTML off version of the user summary page… however you are doing is totally derailing this topic. We have survived fine for the last 5 years without indexing user pages, lots of users DO NOT want their forum user page indexed by Google. Tons of profiles are thin and pointless. Turning this on, even just for TL2, can burn a lot of forums in Google cause of the dupe content.

You can see previous discussions here:

https://meta.discourse.org/t/seo-compared-to-other-well-known-tools/3914

3 Likes

Hi Sam,

Thanks for the response. I tried to make a #feature topic but it was locked and going to be deleted, I didn’t think it would be considered a duplicate as there do not appear to be other feature request threads specifically about this. There are several support posts asking how to edit the robots.txt file.

Here is the post I made specifically for this:
https://meta.discourse.org/t/robots-txt-remove-blocking-user-profiles-or-allow-editing/94082

I searched for a relevant thread and thought that this was the second thread being referenced as the reason for closing the topic. My apologies. Having a way to edit the robots.txt as the title of this thread says, would solve indexing issues for everyone :slight_smile:

Ok understood, most of the arguments for excluding the user profiles I’ve seen are mainly for SEO, duplicate content etc. which I don’t agree with. I didn’t see people not wanting their profiles indexed on a public forum as one of the primary reasons to exclude them, it is a valid reason, just I don’t think it was part of the decision making process (at least publicly) when they were originally removed: Excluding user profiles in robots.txt (or allow edit of file)

I’ll stop posting in this thread now anyhow. Thank you.

The following came as an explanation and supporting use case for why this could/should be edited from our SEO Manager:

The issue here lies in the fact that this is indexed and seen as a) duplicate content b) thin content c) does not provide value and d) affects page (and domain) authority increases.

There is no Google Search Console error, warning, or penalties placed, yet (“Thin content with low or no added value.” manual penalty is the most common penalty). Utilizing other SEO-related tools, this does surface. If requested, I can absolutely provide screenshots of these reports, just let me know.

Moz Pro site crawl reports, for example, mark these Latest pages as having multiple content errors for a variety reasons and the issues associated with these pages include (based upon our domain reports, yet seemingly relevant to others): URL Too Long, Thin Content, Missing Canonical Tag, Overly Dynamic URL, Duplicate Titles and Descriptions, Missing H1, etc.

The question here now seems to be “remove or improve?”. Yes one option is to simply add the noindex, nofollow meta tag values and include this in robots.txt and/or yes, another option is to simply add canonicalization to these duplicates, referencing back to the main /latest page. Also, yes, another option is to keep all of these pages indexed, perhaps then focusing on improvements and optimizations to these pages so the content truly does provide unique value and addresses correcting those issues noted above.

For reference here is information from Google around this

This statement makes zero sense to me, “latest” is the default homepage for Discourse so we absolutely do want people to land there.