なぜ semrushbot と ahrefsbot はデフォルトでブロックされるのですか？

Jamie_Liu1 · 2020 年 7 月 14 日午前 8:57

Google サーチコンソールのカバレッジレポートを確認したところ、多くのフォーラムページが robots.txt によってブロックされていることが分かりました。そこで robots.txt を確認したところ、semrushbot と ahrefsbot がデフォルトでブロックされていることが分かりました：

これらは広く使われている SEO ツールですが、なぜこれらのボットをブロックするのでしょうか？

neounix · 2020 年 7 月 14 日午前 9:03

なぜなら、それらのボットは「リソースを食い尽くすボットの豚」であり、サイトが消費するリソースの量に比べて、提供される価値が非常に低いからです。

もちろん、Discourse の robots.txt ファイルをカスタマイズして、必要に応じて許可することもできます。しかし、Discourse がリリースされるずっと前から当サイトではこれらのボットをブロックしており、現在もブロックし続けています。

注記（編集済み）:

付け加えるのを忘れていましたが、これらの「リソースを食い尽くすボットの豚」の多くは robots.txt を尊重せず、HTTP User Agent レベルでブロックする必要があります。一般的に、これらの「無礼なリソースを食い尽くすボットの豚」は、リバースプロキシレベルで mod_rewrite を使用してブロックしています（リバースプロキシの背後で運用することの多くの利点の一つですが）。

Jamie_Liu1 · 2020 年 7 月 14 日午前 9:29

情報をありがとうございます！

別の問題も見つかりましたので、もしよろしければご意見をお聞かせください。

Discourse はデフォルトでユーザーページをブロックしていますが、Google サーチコンソールのカバレッジレポートにはまだいくつかのユーザーページがインデックスされています。Google にとっては問題です。なぜなら、これらのページはすべてインデックスされてはいけないはずだからです。

ありがとうございます！

osioke · 2020 年 7 月 14 日午後 12:35

これは最近、以下のコミットで修正されました。

Discourse を更新して、再確認してください。

Jamie_Liu1 · 2020 年 7 月 15 日午前 2:14

@osioke ご返信ありがとうございます！私たちのインストール済みバージョンには、すでにその機能が含まれていると思うのですが？修正が1月にコミットされたことに気づいたためです。

この機能を利用するために、最新バージョンへのアップグレードが必要かどうか確認していただけますでしょうか？

osioke · 2020 年 7 月 15 日午前 7:03

個人的には更新しても問題ないと思いますが、その修正はすでにインストール済みのバージョンに含まれているはずです。何か他の理由で更新したくないのでなければ、更新して再検証することをお勧めします。

codinghorror · 2020 年 7 月 15 日午後 9:41

性能が悪いからです。明確な利益をもたらすことなくサーバーへの負荷を大幅に増やしており、当社の顧客はプランごとにページビューの制限が設けられています。

Jamie_Liu1 · 2020 年 7 月 16 日午前 2:13

いいですね。現在アップデート中です。アップデート後に問題が解決することを願っています。後ほどご連絡し、状況をお知らせします。ありがとうございます！

trying2survive · 2020 年 12 月 2 日午後 3:30

念のため確認ですが、semrushbot と seo spider のブロックを解除する方法は全くないのでしょうか？SEO オーディットのためにこれらが必要です。/admin/customize/robots から両方を削除しようとした（Allow: も試しました）ところ、Screaming Frog で 429 エラーが発生しました。この 429 エラーは別の問題でしょうか？ご教示いただけますと幸いです。

Johani · 2020 年 12 月 2 日午後 4:34

429 エラーは、そのクローラーがレート制限されていることを意味します。Discourse には、悪用を防ぐためにデフォルトでスロットリングが有効になっています。詳細については、こちらをご覧ください。

neounix · 2020 年 12 月 3 日午前 9:35

これを試してみましたか（ただし、お使いのコンテナ名に置き換えてください）？

注: これは管理 UI でも設定可能です。

# docker exec -it socket-only bash
root@socket-only:/# rails c
[1] pry(main)> SiteSetting.blocked_crawler_user_agents
=> "mauibot|semrushbot|ahrefsbot|blexbot|seo spider"
[2] pry(main)> SiteSetting.blocked_crawler_user_agents = ""
=> ""
[3] pry(main)> SiteSetting.blocked_crawler_user_agents
=> ""
[4] pry(main)>

参考:

github.com/discourse/discourse

config/site_settings.yml

d1d87b6fa

# Available options:
#
# default            - The default value of the setting. For upload site settings, use the id of the upload seeded in db/fixtures/010_uploads.rb.
# client             - Set to true if the javascript should have access to this setting's value.
# refresh            - Set to true if clients should refresh when the setting is changed.
# min                - For a string setting, the minimum length. For an integer setting, the minimum value.
# max                - For a string setting, the maximum length. For an integer setting, the maximum value.
# regex              - A regex that the value must match.
# validator          - The name of the class that will be use to validate the value of the setting.
# allow_any          - For choice settings allow items not specified in the choice list (default true)
# secret             - Set to true if input type should be password and value needs to be scrubbed from logs (default false).
# enum               - The setting has a fixed set of allowed values, and only one can be chosen.
#                      Set to the class name that defines the set.
# locale_default     - A hash which overrides according to `SiteSetting.default_locale`.
#                      The key should be as the same as possible value of default_locale.
#
#
# type: email    - Must be a valid email address.
# type: username - Must match the username of an existing user.
# type: list     - A list of values, chosen from a set of valid values defined in the choices option.

This file has been truncated. show original

参考:

  def self.allow_crawler?(user_agent)
    return true if SiteSetting.allowed_crawler_user_agents.blank? &&
      SiteSetting.blocked_crawler_user_agents.blank?
...
...

github.com/discourse/discourse

lib/crawler_detection.rb

e0d923225

# frozen_string_literal: true

module CrawlerDetection
  WAYBACK_MACHINE_URL = "archive.org"

  def self.to_matcher(string, type: nil)
    escaped = string.split('|').map { |agent| Regexp.escape(agent) }.join('|')

    if type == :real && Rails.env == "test"
      # we need this bypass so we properly render views
      escaped << "|Rails Testing"
    end

    Regexp.new(escaped, Regexp::IGNORECASE)
  end

  def self.crawler?(user_agent, via_header = nil)
    return true if user_agent.nil? || user_agent&.include?(WAYBACK_MACHINE_URL) || via_header&.include?(WAYBACK_MACHINE_URL)

    # this is done to avoid regenerating regexes

This file has been truncated. show original

コードからわかる通り、これら 2 つのサイト設定を「blank」に設定すれば、ブロックは行われません。

SiteSetting.allowed_crawler_user_agents
SiteSetting.blocked_crawler_user_agents

ただし、変更しないことを強くお勧めします。Discourse コアでデフォルトでブロックされているこれらのボットは robots.txt を尊重しないためです。しかし、これはあなたのサイトですので、ご自由に設定してください。コアでブロックされているには明確な理由があります。

その上で、Discourse では UI の SiteSettings を使用してこれらのボットを「ブロック解除」するオプションを提供しています。

トピック		返信	表示
Handling Bingbot Feature	29	7516	2020 年 11 月 20 日
MegaIndex bot did about 4,000 pageviews on one day Community Building	40	4702	2023 年 12 月 2 日
Issues Google Search Console is throwing at me for wrong discourse structure (or some for wrong administration of my site) Support	18	253	2024 年 12 月 18 日
Why there are lots of Disallow rule in robots.txt? Support	34	4666	2020 年 12 月 22 日
How to protect myself from bots crawling my Discourse instance? Support	6	1638	2022 年 1 月 17 日

なぜ semrushbot と ahrefsbot はデフォルトでブロックされるのですか？

関連トピック