需要编辑 robots.txt 文件 - 它在哪里?

So there have been a number of conversations around the robots.txt file, and how some need / don’t need to edit it based on their specific use-case. What I haven’t seen is exactly where to edit it (knowing that after an upgrade one would need to re-edit the file again). I see in /var/www/discourse/app/views there is a robots.txt directory with some ruby files but not an actual robots.txt file.

So where exactly does this file reside?

Theres a robots_txt_controller.rb which returns some text which ends up in the robots.txt file when served.

It mostly runs on site settings, which you should do your best to use over modifying the file directly.

Those settings being

  • whitelisted_crawler_user_agents
  • blacklisted_crawler_user_agents
  • slow_down_crawler_rate

Yep - those are valid settings. However, with the newest changes Google keeps making, we need to add /latest* to the Disallow list. These end up getting classified as “thin links” by Google and in turn negatively affect site ranking for search over time. For instance this turns up a lot of records which are not valid

It would benefit everyone if we at least could get Latest added to the robots.txt file

I just searched for Google documentation regarding “thin links” but came up empty. Got a link to the Google documentation handy?

Sorry, I meant thin content

And you’ve gotten “Thin content with little or no added value” message(s) on your Manual Actions Report page for the “latest” page?

Anyway, getting a bit off-topic, sorry.

You can write a plugin to add rules to the robots.txt file

https://meta.discourse.org/t/how-to-add-to-robots-txt-host/59718/4?u=mittineague

If this is something that would be of benefit to many people, consider making a PR against core to add it either as a default, or behind a site setting.

I would be extremely against this cause how would google then be able to find new content? Latest is very much required.

@codinghorror what I totally support here is adding noindex, follow to meta tags for /latest and even category topic lists, it will clean up a bunch of messy results. Thoughts?

I am not really following? What’s the problem? Our meta.discourse.org Google Search Console has been running for years, we check it every few months in quite some detail, and I’ve never seen any problems with this reported?

The problem is that we store “latest” content in the search index which is pointless cause we never want people to land on “latest”. The tricky thing is that any mucking around here can rock the boat and this is not a boat we want to rock.

I have seen rare reports of this kind of stuff with user profiles not have noindex in meta tags, but not the reverse.

I’m gonna need to see some screenshot Google Search Console proof to believe this is a problem, because we’ve checked our console pretty closely for years.

As requested, here are screenshots from one site:

This is completely unrelated and we should probably add noindex tags on user pages. This was raised before wrt Bing.

Not sure why we’re operating on the assumption that user pages and other useful pages that have healthy internal linking shouldn’t be indexed when Google’s own forums (e.g. Google Product Forums) allow indexing of user profiles. Same with Reddit, Quora and even Stackoverflow :wink: I guess the biggest forum-like sites in the world and Google themselves must be bad at SEO.

Discourse is free and beggars can’t be choosers, please don’t take this as complaining and I appreciate the work you’re all doing. I’m just attempting to give objective helpful feedback.

On what planet would you want a web search to end up on a user page in a discussion forum? That’s beyond useless.

A huge % of search engine queries are for people, including usernames. On Google Trends, most of the top searches are for people, 14/20 at the time of posting this (70%). https://trends.google.com/trends/trendingsearches/daily?geo=US

Yes, most of these people are celebrities of some sort and this is at macro level. At micro level, there are still lots of searches for people, smaller influencers, bloggers etc. and are some of the most popular searches.

If I search for ‘{user} on {site_name}’ the most relevant result would be that user’s profile, where I can see all their posts. Or if I’m searching for a particular person because I enjoy their content on other platforms, it’d be nice to have the ability to find them on a Discourse forum.

Google knows how to sort results, usually. Having the profiles indexed gives the ability to show the profiles for the searches that they are most relevant for.

The main thing in my opinion, is if not indexing profiles helps stop spammers due to not getting link juice from non-indexed profiles - using the same logic, link juice won’t flow internally through our sites as efficiently. Nofollow tags for all profile links to external URLs would achieve the same result.

You are completely hijacking a topic here that is asking for removal of items from the Google index.

Feel free to open a dedicated topic to discuss this. There are changes we do support like having a reasonable HTML off version of the user summary page… however you are doing is totally derailing this topic. We have survived fine for the last 5 years without indexing user pages, lots of users DO NOT want their forum user page indexed by Google. Tons of profiles are thin and pointless. Turning this on, even just for TL2, can burn a lot of forums in Google cause of the dupe content.

You can see previous discussions here:

https://meta.discourse.org/t/seo-compared-to-other-well-known-tools/3914

嗨,Sam,

感谢回复。我尝试创建一个 Contribute > Feature 主题,但该主题已被锁定并将被删除。我没想到它会被视为重复内容,因为似乎并没有其他专门针对此功能的请求线程。目前有几个支持帖子在询问如何编辑 robots.txt 文件。

以下是我专门为此发布的帖子:
https://meta.discourse.org/t/robots-txt-remove-blocking-user-profiles-or-allow-editing/94082

我搜索了相关线程,并认为这是导致该主题被关闭时引用的第二个线程。对此我表示歉意。正如该主题标题所述,提供编辑 robots.txt 文件的方法将解决所有人的索引问题 :slight_smile:

好的,明白了。我看到的关于排除用户档案的大多数论点主要是出于 SEO、重复内容等方面的考虑,我对此并不认同。我没有看到有人将“不希望自己的档案在公共论坛上被索引”作为排除档案的主要原因之一。这确实是一个合理的理由,但我认为在最初决定移除档案时,它并未被纳入决策过程(至少公开层面如此):Excluding user profiles in robots.txt (or allow edit of file)

无论如何,我今后不会再在此线程中发帖了。谢谢。

The following came as an explanation and supporting use case for why this could/should be edited from our SEO Manager:

The issue here lies in the fact that this is indexed and seen as a) duplicate content b) thin content c) does not provide value and d) affects page (and domain) authority increases.

There is no Google Search Console error, warning, or penalties placed, yet (“Thin content with low or no added value.” manual penalty is the most common penalty). Utilizing other SEO-related tools, this does surface. If requested, I can absolutely provide screenshots of these reports, just let me know.

Moz Pro site crawl reports, for example, mark these Latest pages as having multiple content errors for a variety reasons and the issues associated with these pages include (based upon our domain reports, yet seemingly relevant to others): URL Too Long, Thin Content, Missing Canonical Tag, Overly Dynamic URL, Duplicate Titles and Descriptions, Missing H1, etc.

The question here now seems to be “remove or improve?”. Yes one option is to simply add the noindex, nofollow meta tag values and include this in robots.txt and/or yes, another option is to simply add canonicalization to these duplicates, referencing back to the main /latest page. Also, yes, another option is to keep all of these pages indexed, perhaps then focusing on improvements and optimizations to these pages so the content truly does provide unique value and addresses correcting those issues noted above.

For reference here is information from Google around this

This statement makes zero sense to me, “latest” is the default homepage for Discourse so we absolutely do want people to land there.