Needing to edit robots.txt file - where is it?

jerdog · August 8, 2018, 12:38am

Correct me if I am wrong, but Latest is the default display but not the default link, right? This has to do with the actual /latest link

sam · August 8, 2018, 12:46am

We have every single page of latest in the index, the content is like quicksand and there is nothing in the homepage that is “site specific” and not quicksand which is a big problem:

We absolutely do not want people landing on page 2 / 3 etc… page 1 maybe, but the content on page 1 keeps on changing.

This URL for example https://meta.discourse.org/latest?no_definitions=true&no_subcategories=false&page=2 is stored in the Google index.

I am reticent to change stuff though cause I do not know how the big Google will deal with us adding “dont store in index” directives here. Also people never land on these pages anyway cause Google automatically detects they are rubbish and do not send people there.

If there is anything super positive here, I guess it would be having a wonderful “HTML off” homepage that has useful enough content that search engines would send people to the page.

For example, it would be super nice if discourse community discussions ranked meta.discourse.org first cause we had a nice front page.

A simple fix here we can make that can give us lots of mileage is nice expansion of pinned posts:

They are stable content, we can expand that:

In fact we can even expand it a bit further for crawler views. Additionally we could list all the categories on the home page as well in the crawler view… there is a bunch of stuff we can do.

Pham_Quyet_Nghi · August 8, 2018, 1:41am

Hello!
this is my file

# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file
#
User-agent: *
Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback
Disallow: /assets/browser-update*.js
Disallow: /users/
Disallow: /u/
Disallow: /my/
Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/
Disallow: /email/
Disallow: /session
Disallow: /session/
Disallow: /admin
Disallow: /admin/
Disallow: /user-api-key
Disallow: /user-api-key/
Disallow: /*?api_key*
Disallow: /*?*api_key*
Disallow: /groups
Disallow: /groups/
Disallow: /t/*/*.rss
Disallow: /tags/*.rss
Disallow: /c/*.rss


User-agent: mauibot
Disallow: /


User-agent: bingbot
Crawl-delay: 60
Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback
Disallow: /assets/browser-update*.js
Disallow: /users/
Disallow: /u/
Disallow: /my/
Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/
Disallow: /email/
Disallow: /session
Disallow: /session/
Disallow: /admin
Disallow: /admin/
Disallow: /user-api-key
Disallow: /user-api-key/
Disallow: /*?api_key*
Disallow: /*?*api_key*
Disallow: /groups
Disallow: /groups/
Disallow: /t/*/*.rss
Disallow: /tags/*.rss
Disallow: /c/*.rss

I read the tutorials above but I do not understand how to fix the question “Need to edit robots.txt file - where is it?”. Looking forward to receiving help from the community

This is the content to be want to update

# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
User-agent: *
Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback
Disallow: /assets/browser-update*.js
Disallow: /users/
Disallow: /u/
Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/

Thanks all

Stranik · August 8, 2018, 6:58am

I think you can override the file in your own plugin.

github.com

discourse/discourse/blob/main/app/views/robots_txt/index.erb

<%= @robots_info[:header] %>
<% if Discourse.base_path.present? %>
# This robots.txt file is not used. Please append the content below in the robots.txt file located at the root
<% end %>
#
<% @robots_info[:agents].each do |agent| %>
User-agent: <%= agent[:name] %>
<% agent[:disallow].each do |path| %>
Disallow: <%= path %>
<% end %>


<% end %>

<%- if SiteSetting.enable_sitemap? && !SiteSetting.login_required? %>
Sitemap: <%= request.protocol %><%= request.host_with_port %>/sitemap.xml
<% end %>

<%= server_plugin_outlet "robots_txt_index" %>

Pham_Quyet_Nghi · August 11, 2018, 2:05am

My archive directory is this

robots%20txt

how to override the file in your own plugin

Thanks

cpradio · August 11, 2018, 2:15am

You will want to read the plugin development topics and then read this
https://meta.discourse.org/t/how-to-block-all-crawlers-but-googles/62431/4?u=cpradio

Pham_Quyet_Nghi · August 11, 2018, 2:26am

I really do not want to block the google search engine that I want to change by content in the robots.txt file

Why does my website not find such a directory /discourse/app/views ?

Mittineague · August 11, 2018, 3:32am

There is no robots.txt text file per se. It is a Ruby controller
https://github.com/discourse/discourse/blob/master/app/controllers/robots_txt_controller.rb

cpradio · August 11, 2018, 3:42am

You really need to read some of the dev topics, it explains all of that and more. The plugin should be trivial, to be honest. Or you can post something in marketplace with a budget to see if someone will build it for you.

j127 · August 16, 2018, 5:53pm

If that is added, could it be made into an overridable setting? I clicked on this link in the newsletter, because getting user pages indexed is also something we need. We’re hoping to add additional information to them and eventually redirect the old (indexed) user pages to the Discourse ones.

j127 · April 13, 2019, 5:08pm

I was just noticing this problem on one of my Discourse sites. The way to block those dynamic URLs from bots while still allowing search engines to crawl /latest is this:

Disallow: /latest?

That will only block the dynamic ones, but not /latest, so search engines would still be able to see the latest content. I tested the rule in Google’s Webmaster Tools and it works.

Here’s an example of some of the dynamic URLs that are getting crawled on my site:

https://gist.githubusercontent.com/j127/d329c15dab45369b03321cad40448734/raw/300aa579b1386087b903da6aa52c52ff5d95828c/latest.txt

Is it possible to add that one line to robots.txt?

(Edit: I looked more closely at the file, and I wouldn’t use noindex there, at least on that dynamic rule. I’m pretty sure that Google has recommended not to use noindex in robots.txt though it was several years ago.)

codinghorror · July 9, 2019, 11:30pm

You can now ban or limit abusive webcrawlers via site settings which indirectly edits robots.txt but we still don’t provide arbitrary editing ability.

I think we should though … @eviltrout can you scope this for 2.4? It answers a lot of requests, many of which we don’t agree with, but my attitude on this is “it’s your funeral so go for it if you feel you must ”

Stephen · July 10, 2019, 2:24am

Can we at least mark editing the robots.txt as totally outside the scope for community support?

vinothkannans · July 10, 2019, 5:17am

FTR, anyone can easily add additional rules through a simple plugin using the “robots_txt_index” connector template. For example: app/views/connectors/robots_txt_index/sitemap.html.erb

eviltrout · July 10, 2019, 7:25pm

Here’s how I think it should work:

Add a new URL to the admin section which is not linked directory. For example /admin/customize/robots
- Show a <textarea> with the current robots.txt content.
- If they’ve not edited it before, pre-fill it with the contents based on the white/blacklist.
- When the admin mashes Save Changes, it should be saved to the database and will replace the existing contents for robots.txt for that forum.

codinghorror · July 10, 2019, 7:31pm

I am strongly opposed to this, because it gives an obscure and dangerous option top billing in the UI.

I think the route to customize robots.txt should be custom and hand entered for now. If users want it they need to search google or meta and find the path.

eviltrout · July 10, 2019, 7:44pm

That’s why I hid it behind “Advanced Edit”, but if we are obscuring the interface I can simplify it further (will edit that post.)

Osama · July 11, 2019, 8:44pm

I’ve created a PR for this:

https://github.com/discourse/discourse/pull/7884

Screenshots:

codinghorror · July 12, 2019, 12:14am

Looks good! Make sure revert button uses the correct glyph, the same one we use on revert in site settings. Also we just use the word “reset” so you can repurpose that copy rather than creating a new translation.

Also we need some warnings about the handful of site settings that modify robots.txt which will be overridden if you manually edit etc.

Osama · July 15, 2019, 6:44pm

PR was just merged:

https://github.com/discourse/discourse/commit/6515ff19e5c8e62ba3aaecb5947eaccdcbbaf0dd

If you update to latest tests-passed, you’ll be able to customize robots.txt at /admin/customize/robots. The page is not linked to from anywhere in the UI, you’ll have to copy and paste the URL manually into your browser.

Note: if you override the file, any later changes to the site settings that modify robots.txt (e.g. whitelisted crawler user agents etc.) won’t apply to the file (settings will save correctly, but changes won’t reflect on robots.txt). You can restore to the default version and the site settings will apply to the file again.

If there are overrides AND an admin views the file at /robots.txt, they’ll see a comment on the top that says there’re overrides and links to where they can modify the file or reset to the default version.

Topic		Replies	Views
Why there are lots of Disallow rule in robots.txt? Support	34	4548	December 22, 2020
Issues Google Search Console is throwing at me for wrong discourse structure (or some for wrong administration of my site) Support	18	162	December 18, 2024
Excluding user profiles in robots.txt (or allow edit of file) Feature	5	2500	May 24, 2014
Pages listed in the robots.txt are crawled and indexed by Google Support	19	3271	July 30, 2019
Google indexing issue (robots.txt) Support	9	683	May 23, 2024

Needing to edit robots.txt file - where is it?

Related topics