robots.txt ファイルの編集が必要ですが、どこにありますか？

jerdog · 2018 年 8 月 8 日午前 12:38

Correct me if I am wrong, but Latest is the default display but not the default link, right? This has to do with the actual /latest link

sam · 2018 年 8 月 8 日午前 12:46

We have every single page of latest in the index, the content is like quicksand and there is nothing in the homepage that is “site specific” and not quicksand which is a big problem:

We absolutely do not want people landing on page 2 / 3 etc.. page 1 maybe, but the content on page 1 keeps on changing.

This URL for example https://meta.discourse.org/latest?no_definitions=true&no_subcategories=false&page=2 is stored in the Google index.

I am reticent to change stuff though cause I do not know how the big Google will deal with us adding “dont store in index” directives here. Also people never land on these pages anyway cause Google automatically detects they are rubbish and do not send people there.

If there is anything super positive here, I guess it would be having a wonderful “HTML off” homepage that has useful enough content that search engines would send people to the page.

For example, it would be super nice if discourse community discussions ranked meta.discourse.org first cause we had a nice front page.

A simple fix here we can make that can give us lots of mileage is nice expansion of pinned posts:

They are stable content, we can expand that:

In fact we can even expand it a bit further for crawler views. Additionally we could list all the categories on the home page as well in the crawler view… there is a bunch of stuff we can do.

Pham_Quyet_Nghi · 2018 年 8 月 8 日午前 1:41

Hello!
this is my file

# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file
#
User-agent: *
Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback
Disallow: /assets/browser-update*.js
Disallow: /users/
Disallow: /u/
Disallow: /my/
Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/
Disallow: /email/
Disallow: /session
Disallow: /session/
Disallow: /admin
Disallow: /admin/
Disallow: /user-api-key
Disallow: /user-api-key/
Disallow: /*?api_key*
Disallow: /*?*api_key*
Disallow: /groups
Disallow: /groups/
Disallow: /t/*/*.rss
Disallow: /tags/*.rss
Disallow: /c/*.rss


User-agent: mauibot
Disallow: /


User-agent: bingbot
Crawl-delay: 60
Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback
Disallow: /assets/browser-update*.js
Disallow: /users/
Disallow: /u/
Disallow: /my/
Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/
Disallow: /email/
Disallow: /session
Disallow: /session/
Disallow: /admin
Disallow: /admin/
Disallow: /user-api-key
Disallow: /user-api-key/
Disallow: /*?api_key*
Disallow: /*?*api_key*
Disallow: /groups
Disallow: /groups/
Disallow: /t/*/*.rss
Disallow: /tags/*.rss
Disallow: /c/*.rss

I read the tutorials above but I do not understand how to fix the question “Need to edit robots.txt file - where is it?”. Looking forward to receiving help from the community

This is the content to be want to update

# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
User-agent: *
Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback
Disallow: /assets/browser-update*.js
Disallow: /users/
Disallow: /u/
Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/

Thanks all

Stranik · 2018 年 8 月 8 日午前 6:58

I think you can override the file in your own plugin.

github.com/discourse/discourse

app/views/robots_txt/index.erb

main

<%= @robots_info[:header] %>
<% if Discourse.base_path.present? %>
# This robots.txt file is not used. Please append the content below in the robots.txt file located at the root
<% end %>
#
<% @robots_info[:agents].each do |agent| %>
User-agent: <%= agent[:name] %>
<% agent[:disallow].each do |path| %>
Disallow: <%= path %>
<% end %>


<% end %>

<%- if SiteSetting.enable_sitemap? && !SiteSetting.login_required? %>
Sitemap: <%= request.protocol %><%= request.host_with_port %>/sitemap.xml
<% end %>

<%= server_plugin_outlet "robots_txt_index" %>

Pham_Quyet_Nghi · 2018 年 8 月 11 日午前 2:05

My archive directory is this

how to override the file in your own plugin

Thanks

cpradio · 2018 年 8 月 11 日午前 2:15

You will want to read the plugin development topics and then read this
https://meta.discourse.org/t/how-to-block-all-crawlers-but-googles/62431/4?u=cpradio

Pham_Quyet_Nghi · 2018 年 8 月 11 日午前 2:26

I really do not want to block the google search engine that I want to change by content in the robots.txt file

Why does my website not find such a directory /discourse/app/views ?

Mittineague · 2018 年 8 月 11 日午前 3:32

There is no robots.txt text file per se. It is a Ruby controller

github.com/discourse/discourse

app/controllers/robots_txt_controller.rb

main

# frozen_string_literal: true

class RobotsTxtController < ApplicationController
  layout false
  skip_before_action :preload_json,
                     :check_xhr,
                     :redirect_to_login_if_required,
                     :redirect_to_profile_if_required

  OVERRIDDEN_HEADER = "# This robots.txt file has been customized at /admin/customize/robots\n"

  # NOTE: order is important!
  DISALLOWED_PATHS = %w[
    /admin/
    /auth/
    /assets/js/browser-update*.js
    /email/
    /session
    /user-api-key
    /*?api_key*

This file has been truncated. show original

cpradio · 2018 年 8 月 11 日午前 3:42

You really need to read some of the Development topics, it explains all of that and more. The plugin should be trivial, to be honest. Or you can post something in Marketplace with a budget to see if someone will build it for you.

j127 · 2018 年 8 月 16 日午後 5:53

If that is added, could it be made into an overridable setting? I clicked on this link in the newsletter, because getting user pages indexed is also something we need. We’re hoping to add additional information to them and eventually redirect the old (indexed) user pages to the Discourse ones.

j127 · 2019 年 4 月 13 日午後 5:08

I was just noticing this problem on one of my Discourse sites. The way to block those dynamic URLs from bots while still allowing search engines to crawl /latest is this:

Disallow: /latest?

That will only block the dynamic ones, but not /latest, so search engines would still be able to see the latest content. I tested the rule in Google’s Webmaster Tools and it works.

Here’s an example of some of the dynamic URLs that are getting crawled on my site:

https://gist.githubusercontent.com/j127/d329c15dab45369b03321cad40448734/raw/300aa579b1386087b903da6aa52c52ff5d95828c/latest.txt

Is it possible to add that one line to robots.txt?

(Edit: I looked more closely at the file, and I wouldn’t use noindex there, at least on that dynamic rule. I’m pretty sure that Google has recommended not to use noindex in robots.txt though it was several years ago.)

codinghorror · 2019 年 7 月 9 日午後 11:30

サイト設定から悪質なウェブクローラーを禁止または制限できるようになりました。これにより robots.txt が間接的に編集されますが、任意の編集機能は依然として提供されていません。

しかし、提供するべきだと思います。@eviltrout さん、これを 2.4 のスコープに含めることは可能でしょうか？これにより多くの要望に対応できます。その多くには同意していませんが、私のスタンスは「あなたの責任ですので、どうしても必要なら実行してください」というものです。

Stephen · 2019 年 7 月 10 日午前 2:24

コミュニティサポートの範囲外として、robots.txt の編集を完全に除外することはできませんか？

vinothkannans · 2019 年 7 月 10 日午前 5:17

参考までに、robots_txt_index コネクターテンプレートを使用して、簡単なプラグインで追加のルールを誰でも容易に追加できます。例えば: app/views/connectors/robots_txt_index/sitemap.html.erb

eviltrout · 2019 年 7 月 10 日午後 7:25

以下が私の考えです。

管理者セクションに、直接リンクされていない新しい URL を追加します。例：/admin/customize/robots
- 現在の robots.txt の内容を表示する <textarea> を表示します。
- 以前に編集されていない場合、ホワイトリスト/ブラックリストに基づいた内容で事前に埋めます。
- 管理者が 変更を保存 をクリックすると、データベースに保存され、そのフォーラムの robots.txt の既存の内容が置き換えられます。

codinghorror · 2019 年 7 月 10 日午後 7:31

私はこれに強く反対します。なぜなら、不明瞭で危険なオプションを UI の最上位に表示することになるからです。
robots.txt をカスタマイズするパスは、現時点では独自の手動入力とするべきだと考えます。ユーザーがそれを望む場合は、Google やメタで検索してパスを見つける必要があります。

eviltrout · 2019 年 7 月 10 日午後 7:44

そのため、それを「高度な編集」の裏に隠しましたが、もしインターフェースが複雑になりすぎているなら、さらにシンプルにできます（その投稿を編集します）。

Osama · 2019 年 7 月 11 日午後 8:44

これに対して PR を作成しました：

github.com/discourse/discourse

FEATURE: Allow customization of robots.txt (#7884)

master ← OsamaSayegh:customize-robots-txt

merged 05:47PM - 15 Jul 19 UTC

OsamaSayegh

+282 -7

This allows admins to customize/override the content of the robots.txt file at …/admin/customize/robots. That page is not linked to anywhere in the UI -- admins have to manually type the URL to access that page. Meta topic: https://meta.discourse.org/t/needing-to-edit-robots-txt-file-where-is-it/93879?u=osama Screenshots: ![image](https://user-images.githubusercontent.com/17474474/61083151-b5174800-a433-11e9-890e-283221046b2b.png) ![image](https://user-images.githubusercontent.com/17474474/61083312-16d7b200-a434-11e9-9cae-dc3a9982c049.png) @eviltrout does it make sense to prepend a comment to robots.txt that says something along the lines of "this robots.txt file has been customized at /admin/customize/robots" **if** the file is customized? It might help with figuring out why certain things are in the file and how to remove/change them?

スクリーンショット：

codinghorror · 2019 年 7 月 12 日午前 12:14

良さそうです！「元に戻す」ボタンには、サイト設定の「元に戻す」で使っているのと同じグリフを使用してください。また、私たちは「リセット」という言葉だけを使っているので、新しい翻訳を作成するのではなく、既存のテキストを流用してください。

さらに、robots.txt を変更するいくつかのサイト設定については、手動で編集すると上書きされてしまうため、その旨を警告する必要があります。

Osama · 2019 年 7 月 15 日午後 6:44

PR がマージされました

最新のテスト通過バージョンに更新すれば、/admin/customize/robots から robots.txt をカスタマイズできるようになります。このページは UI 内のどこからもリンクされていないため、URL を手動でコピーしてブラウザに貼り付ける必要があります。

注意：ファイルをオーバーライドした場合、その後のサイト設定の変更（whitelisted crawler user agents など）が robots.txt に反映されなくなります（設定は正しく保存されますが、robots.txt には変更が反映されません）。デフォルトバージョンに復元すると、サイト設定が再びファイルに適用されるようになります。

オーバーライドが存在し、かつ管理者が /robots.txt でファイルを表示した場合、ファイルの上部にオーバーライドがある旨のコメントが表示され、ファイルを編集したりデフォルトバージョンにリセットしたりできる場所へのリンクが含まれます。

トピック		返信	表示
Why there are lots of Disallow rule in robots.txt? Support	34	4738	2020 年 12 月 22 日
Issues Google Search Console is throwing at me for wrong discourse structure (or some for wrong administration of my site) Support	18	279	2024 年 12 月 18 日
Excluding user profiles in robots.txt (or allow edit of file) Feature	4	2554	2014 年 5 月 24 日
Pages listed in the robots.txt are crawled and indexed by Google Support	18	3411	2019 年 7 月 30 日
Google notification to remove "noindex" statements from robots.txt Support	7	2513	2019 年 7 月 30 日

robots.txt ファイルの編集が必要ですが、どこにありますか？

関連トピック