I was checking the Google Search Console coverage report and found that lots of our forum pages are blocked by robots.txt. So I went ahead and checked the robots.txt. Then I found that semrushbot and ahrefsbot are blocked by default:
I know these are two widely used SEO tools, why blocking their bots?
Because those bots are “resource sucking bot hogs” which provide very little value to sites compared to the amount of resources these bots consume.
Of course, you can customize the Discourse robots.txt file and permit them if you wish; but we block these bots on our sites long before Discourse was released and keep them blocked.
Note (Edited):
I forgot to mention that many of these “resource sucking bot hogs” do not respect robots.txt and they must be blocked at the HTTP User Agent level. We block these "disrespectful resource sucking bot hogs” with mod_rewrite at the reverse proxy level, generally speaking (one of the many good reasons to run behind a reverse proxy, BTW).
I found another issue and maybe you can share your insight on it as well.
I know Discourse has blocked user pages by default, but in my Google Search Console coverage report, there are still some user pages indexed, which is an issue in Google’s eyes because all these pages should not be indexed:
It doesn’t hurt to update IMO, but yes, that fix should be in your installed version. I would try updating and reverifying unless you don’t want to update for some other reason.
Just to clarify, is there no way to unblock semrushbot and seo spider? We need them for SEO audit. Tried removing both from /admin/customize/robots (also tried Allow: ) but we get 429 error in Screaming Frog. Or is this 429 error a separate issue? Your insights are highly appreciated.
429 errors mean that those crawlers are getting rate-limited. Discourse has some throttling enabled by default to prevent abuse. You can read more about this here.
def self.allow_crawler?(user_agent)
return true if SiteSetting.allowed_crawler_user_agents.blank? &&
SiteSetting.blocked_crawler_user_agents.blank?
...
...
You can see from the code that if you set these two site settings to “blank” then there will be no blocking:
SiteSetting.allowed_crawler_user_agents
SiteSetting.blocked_crawler_user_agents
I recommend you do not change this because these bots which are blocked by Discourse core by default do not respect robots.txt; however, it’s your site and so you can do as you wish. There is a good reason they are blocked in core.
Having said that, Discourse gives you the option to “unblock” these using your SiteSettings in the UI.