为什么 semrushbot 和 ahrefsbot 默认被阻止？

Jamie_Liu1 · 2020 年7 月 14 日 08:57

我检查了 Google Search Console 的覆盖报告，发现我们论坛的许多页面都被 robots.txt 阻止了。于是我去查看了 robots.txt 文件，结果发现 semrushbot 和 ahrefsbot 默认被阻止了：

我知道这两个是广泛使用的 SEO 工具，为什么要阻止它们的爬虫呢？

neounix · 2020 年7 月 14 日 09:03

因为这些爬虫是“资源吞噬型流氓机器人”，它们消耗大量服务器资源，却给网站带来的价值微乎其微。

当然，你可以自定义 Discourse 的 robots.txt 文件并在需要时允许它们访问；但我们在 Discourse 发布之前很久就已经在自己的网站上屏蔽了这些爬虫，并且一直保持着屏蔽状态。

备注（已编辑）：

我忘了提到，许多这类“资源吞噬型流氓机器人”并不遵守 robots.txt 规则，因此必须在 HTTP User Agent 层面进行屏蔽。一般来说，我们会在反向代理层面通过 mod_rewrite 来屏蔽这些“不守规矩的资源吞噬型流氓机器人”（顺便一提，这也是在反向代理后面运行的众多好处之一）。

Jamie_Liu1 · 2020 年7 月 14 日 09:29

非常感谢提供的信息！

我又发现了一个问题，也许您也能分享一下您的见解。

我知道 Discourse 默认已阻止用户页面，但在我的 Google Search Console 覆盖范围报告中，仍有一些用户页面被索引。这在 Google 看来是个问题，因为这些页面都不应被索引：

谢谢！

osioke · 2020 年7 月 14 日 12:35

这个问题最近已通过以下提交修复：

请更新您的 Discourse 并重新验证。

Jamie_Liu1 · 2020 年7 月 15 日 02:14

@osioke 感谢您的回复！我认为我们已安装的版本应该已经包含该功能了？因为我注意到该修复是在 1 月提交的。

能否请您确认一下，我是否需要升级到最新版本才能使用该功能？

osioke · 2020 年7 月 15 日 07:03

依我看，更新一下也无妨，不过该修复确实应该包含在你已安装的版本中。我建议你尝试更新并重新验证，除非你因其他原因不想更新。

codinghorror · 2020 年7 月 15 日 21:41

因为它们很糟糕？它们给服务器带来大量负载，却没有任何可辨识的好处，而且我们的客户套餐中确实有页面浏览量限制。

Jamie_Liu1 · 2020 年7 月 16 日 02:13

听起来不错。我们正在更新。希望更新后一切顺利。我会再联系并随时告知您进展。谢谢！

trying2survive · 2020 年12 月 2 日 15:30

请澄清一下，是否无法解除对 Semrushbot 和 SEO Spider 的屏蔽？我们需要它们进行 SEO 审计。已尝试在 /admin/customize/robots 中移除这两者（也尝试过添加 Allow: 指令），但在 Screaming Frog 中仍收到 429 错误。这个 429 错误是否是一个独立的问题？非常感谢您的见解。

Johani · 2020 年12 月 2 日 16:34

429 错误表示这些爬虫受到了速率限制。Discourse 默认启用了一些限流机制以防止滥用。您可以在此处了解更多相关信息。

neounix · 2020 年12 月 3 日 09:35

你试过这个方法吗（但请将容器名称替换为你自己的容器名称）？

注意：你也可以在管理界面（Admin UI）中配置此项：

# docker exec -it socket-only bash
root@socket-only:/# rails c
[1] pry(main)> SiteSetting.blocked_crawler_user_agents
=> "mauibot|semrushbot|ahrefsbot|blexbot|seo spider"
[2] pry(main)> SiteSetting.blocked_crawler_user_agents = ""
=> ""
[3] pry(main)> SiteSetting.blocked_crawler_user_agents
=> ""
[4] pry(main)>

另请参阅：

github.com/discourse/discourse

config/site_settings.yml

d1d87b6fa

# Available options:
#
# default            - The default value of the setting. For upload site settings, use the id of the upload seeded in db/fixtures/010_uploads.rb.
# client             - Set to true if the javascript should have access to this setting's value.
# refresh            - Set to true if clients should refresh when the setting is changed.
# min                - For a string setting, the minimum length. For an integer setting, the minimum value.
# max                - For a string setting, the maximum length. For an integer setting, the maximum value.
# regex              - A regex that the value must match.
# validator          - The name of the class that will be use to validate the value of the setting.
# allow_any          - For choice settings allow items not specified in the choice list (default true)
# secret             - Set to true if input type should be password and value needs to be scrubbed from logs (default false).
# enum               - The setting has a fixed set of allowed values, and only one can be chosen.
#                      Set to the class name that defines the set.
# locale_default     - A hash which overrides according to `SiteSetting.default_locale`.
#                      The key should be as the same as possible value of default_locale.
#
#
# type: email    - Must be a valid email address.
# type: username - Must match the username of an existing user.
# type: list     - A list of values, chosen from a set of valid values defined in the choices option.

This file has been truncated. show original

另请参阅：

  def self.allow_crawler?(user_agent)
    return true if SiteSetting.allowed_crawler_user_agents.blank? &&
      SiteSetting.blocked_crawler_user_agents.blank?
...
...

github.com/discourse/discourse

lib/crawler_detection.rb

e0d923225

# frozen_string_literal: true

module CrawlerDetection
  WAYBACK_MACHINE_URL = "archive.org"

  def self.to_matcher(string, type: nil)
    escaped = string.split('|').map { |agent| Regexp.escape(agent) }.join('|')

    if type == :real && Rails.env == "test"
      # we need this bypass so we properly render views
      escaped << "|Rails Testing"
    end

    Regexp.new(escaped, Regexp::IGNORECASE)
  end

  def self.crawler?(user_agent, via_header = nil)
    return true if user_agent.nil? || user_agent&.include?(WAYBACK_MACHINE_URL) || via_header&.include?(WAYBACK_MACHINE_URL)

    # this is done to avoid regenerating regexes

This file has been truncated. show original

从代码中可以看出，如果你将这两个站点设置设为“空白”，则不会进行任何封锁：

SiteSetting.allowed_crawler_user_agents
SiteSetting.blocked_crawler_user_agents

我建议你不要修改这些设置，因为 Discourse 核心默认封锁的这些爬虫并不遵守 robots.txt 协议；不过，这是你的网站，你可以按自己的意愿操作。核心中默认封锁它们是有充分理由的。

话虽如此，Discourse 确实提供了通过 UI 中的站点设置（SiteSettings）来“解除封锁”这些爬虫的选项。

话题		回复	浏览量
Handling Bingbot Feature	29	7516	2020 年11 月 20 日
MegaIndex bot did about 4,000 pageviews on one day Community Building	40	4702	2023 年12 月 2 日
Issues Google Search Console is throwing at me for wrong discourse structure (or some for wrong administration of my site) Support	18	253	2024 年12 月 18 日
Why there are lots of Disallow rule in robots.txt? Support	34	4666	2020 年12 月 22 日
How to protect myself from bots crawling my Discourse instance? Support	6	1638	2022 年1 月 17 日

为什么 semrushbot 和 ahrefsbot 默认被阻止？

相关话题