Why semrushbot and ahrefsbot are blocked by default?

I was checking the Google Search Console coverage report and found that lots of our forum pages are blocked by robots.txt. So I went ahead and checked the robots.txt. Then I found that semrushbot and ahrefsbot are blocked by default:
image

I know these are two widely used SEO tools, why blocking their bots?

2 Likes

Because those bots are “resource sucking bot hogs” which provide very little value to sites compared to the amount of resources these bots consume.

Of course, you can customize the Discourse robots.txt file and permit them if you wish; but we block these bots on our sites long before Discourse was released and keep them blocked.

:slight_smile:


Note (Edited):

I forgot to mention that many of these “resource sucking bot hogs” do not respect robots.txt and they must be blocked at the HTTP User Agent level. We block these "disrespectful resource sucking bot hogs” with mod_rewrite at the reverse proxy level, generally speaking (one of the many good reasons to run behind a reverse proxy, BTW).

7 Likes

Thanks so much for the information!

I found another issue and maybe you can share your insight on it as well. :slight_smile:

I know Discourse has blocked user pages by default, but in my Google Search Console coverage report, there are still some user pages indexed, which is an issue in Google’s eyes because all these pages should not be indexed:

Thanks!

1 Like

This was fixed recently with

https://github.com/discourse/discourse/commit/13f229808a22db9e1032832a313ab701b66614c8

Can you update your Discourse and reverify?

1 Like

@osioke Thanks for your reply! I believe our installed version already has the feature? Because I noticed that the fix was committed in Jan.

Could you please verify if I need to upgrade to the latest version to have this feature?

1 Like

It doesn’t hurt to update IMO, but yes, that fix should be in your installed version. I would try updating and reverifying unless you don’t want to update for some other reason.

3 Likes

Because they suck? They add a lot of server load for no discernable benefit, and our customers do have pageview limits on their plans.

7 Likes

Sounds good. We are updating now. Hope things will work out after the update. I’ll get back and keep you informed. :slight_smile: Thanks!

Just to clarify, is there no way to unblock semrushbot and seo spider? We need them for SEO audit. Tried removing both from /admin/customize/robots (also tried Allow: ) but we get 429 error in Screaming Frog. Or is this 429 error a separate issue? Your insights are highly appreciated.

1 Like

429 errors mean that those crawlers are getting rate-limited. Discourse has some throttling enabled by default to prevent abuse. You can read more about this here.

3 Likes

Did you try this (but use your container name)?

Note: you can also configure this in the Admin UI:

# docker exec -it socket-only bash
root@socket-only:/# rails c
[1] pry(main)> SiteSetting.blocked_crawler_user_agents
=> "mauibot|semrushbot|ahrefsbot|blexbot|seo spider"
[2] pry(main)> SiteSetting.blocked_crawler_user_agents = ""
=> ""
[3] pry(main)> SiteSetting.blocked_crawler_user_agents
=> ""
[4] pry(main)> 

See also:

See also:

  def self.allow_crawler?(user_agent)
    return true if SiteSetting.allowed_crawler_user_agents.blank? &&
      SiteSetting.blocked_crawler_user_agents.blank?
...
...

You can see from the code that if you set these two site settings to “blank” then there will be no blocking:

  • SiteSetting.allowed_crawler_user_agents
  • SiteSetting.blocked_crawler_user_agents

I recommend you do not change this because these bots which are blocked by Discourse core by default do not respect robots.txt; however, it’s your site and so you can do as you wish. There is a good reason they are blocked in core.

Having said that, Discourse gives you the option to “unblock” these using your SiteSettings in the UI.

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.