Why semrushbot and ahrefsbot are blocked by default?

Jamie_Liu1 · July 14, 2020, 8:57am

I was checking the Google Search Console coverage report and found that lots of our forum pages are blocked by robots.txt. So I went ahead and checked the robots.txt. Then I found that semrushbot and ahrefsbot are blocked by default:

I know these are two widely used SEO tools, why blocking their bots?

neounix · July 14, 2020, 9:03am

Because those bots are “resource sucking bot hogs” which provide very little value to sites compared to the amount of resources these bots consume.

Of course, you can customize the Discourse robots.txt file and permit them if you wish; but we block these bots on our sites long before Discourse was released and keep them blocked.

Note (Edited):

I forgot to mention that many of these “resource sucking bot hogs” do not respect robots.txt and they must be blocked at the HTTP User Agent level. We block these "disrespectful resource sucking bot hogs” with mod_rewrite at the reverse proxy level, generally speaking (one of the many good reasons to run behind a reverse proxy, BTW).

Jamie_Liu1 · July 14, 2020, 9:29am

Thanks so much for the information!

I found another issue and maybe you can share your insight on it as well.

I know Discourse has blocked user pages by default, but in my Google Search Console coverage report, there are still some user pages indexed, which is an issue in Google’s eyes because all these pages should not be indexed:

Thanks!

osioke · July 14, 2020, 12:35pm

This was fixed recently with

https://github.com/discourse/discourse/commit/13f229808a22db9e1032832a313ab701b66614c8

Can you update your Discourse and reverify?

Jamie_Liu1 · July 15, 2020, 2:14am

@osioke Thanks for your reply! I believe our installed version already has the feature? Because I noticed that the fix was committed in Jan.

Could you please verify if I need to upgrade to the latest version to have this feature?

osioke · July 15, 2020, 7:03am

It doesn’t hurt to update IMO, but yes, that fix should be in your installed version. I would try updating and reverifying unless you don’t want to update for some other reason.

codinghorror · July 15, 2020, 9:41pm

Because they suck? They add a lot of server load for no discernable benefit, and our customers do have pageview limits on their plans.

Jamie_Liu1 · July 16, 2020, 2:13am

Sounds good. We are updating now. Hope things will work out after the update. I’ll get back and keep you informed. Thanks!

trying2survive · December 2, 2020, 3:30pm

Just to clarify, is there no way to unblock semrushbot and seo spider? We need them for SEO audit. Tried removing both from /admin/customize/robots (also tried Allow: ) but we get 429 error in Screaming Frog. Or is this 429 error a separate issue? Your insights are highly appreciated.

Johani · December 2, 2020, 4:34pm

429 errors mean that those crawlers are getting rate-limited. Discourse has some throttling enabled by default to prevent abuse. You can read more about this here.

neounix · December 3, 2020, 9:35am

Did you try this (but use your container name)?

Note: you can also configure this in the Admin UI:

# docker exec -it socket-only bash
root@socket-only:/# rails c
[1] pry(main)> SiteSetting.blocked_crawler_user_agents
=> "mauibot|semrushbot|ahrefsbot|blexbot|seo spider"
[2] pry(main)> SiteSetting.blocked_crawler_user_agents = ""
=> ""
[3] pry(main)> SiteSetting.blocked_crawler_user_agents
=> ""
[4] pry(main)>

discourse/discourse/blob/d1d87b6fa3f8279bb35d264c997d37a185aa75d9/config/site_settings.yml

# Available options:
#
# default            - The default value of the setting. For upload site settings, use the id of the upload seeded in db/fixtures/010_uploads.rb.
# client             - Set to true if the javascript should have access to this setting's value.
# refresh            - Set to true if clients should refresh when the setting is changed.
# min                - For a string setting, the minimum length. For an integer setting, the minimum value.
# max                - For a string setting, the maximum length. For an integer setting, the maximum value.
# regex              - A regex that the value must match.
# validator          - The name of the class that will be use to validate the value of the setting.
# allow_any          - For choice settings allow items not specified in the choice list (default true)
# secret             - Set to true if input type should be password and value needs to be scrubbed from logs (default false).
# enum               - The setting has a fixed set of allowed values, and only one can be chosen.
#                      Set to the class name that defines the set.
# locale_default     - A hash which overrides according to `SiteSetting.default_locale`.
#                      The key should be as the same as possible value of default_locale.
#
#
# type: email    - Must be a valid email address.
# type: username - Must match the username of an existing user.
# type: list     - A list of values, chosen from a set of valid values defined in the choices option.

This file has been truncated. show original

discourse/discourse/blob/e0d9232259f6fb0f76bca471c4626178665ca24a/lib/crawler_detection.rb

# frozen_string_literal: true

module CrawlerDetection
  WAYBACK_MACHINE_URL = "archive.org"

  def self.to_matcher(string, type: nil)
    escaped = string.split('|').map { |agent| Regexp.escape(agent) }.join('|')

    if type == :real && Rails.env == "test"
      # we need this bypass so we properly render views
      escaped << "|Rails Testing"
    end

    Regexp.new(escaped, Regexp::IGNORECASE)
  end

  def self.crawler?(user_agent, via_header = nil)
    return true if user_agent.nil? || user_agent&.include?(WAYBACK_MACHINE_URL) || via_header&.include?(WAYBACK_MACHINE_URL)

    # this is done to avoid regenerating regexes

This file has been truncated. show original

You can see from the code that if you set these two site settings to “blank” then there will be no blocking:

SiteSetting.allowed_crawler_user_agents
SiteSetting.blocked_crawler_user_agents

I recommend you do not change this because these bots which are blocked by Discourse core by default do not respect robots.txt; however, it’s your site and so you can do as you wish. There is a good reason they are blocked in core.

Having said that, Discourse gives you the option to “unblock” these using your SiteSettings in the UI.

system · January 2, 2021, 9:35am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Handling Bingbot Feature	29	7357	November 20, 2020
MegaIndex bot did about 4,000 pageviews on one day Community	40	4441	December 2, 2023
Issues Google Search Console is throwing at me for wrong discourse structure (or some for wrong administration of my site) Support	18	138	December 18, 2024
Why there are lots of Disallow rule in robots.txt? Support	34	4521	December 22, 2020
How to protect myself from bots crawling my Discourse instance? Support	6	1584	January 17, 2022

Why semrushbot and ahrefsbot are blocked by default?

Related topics