How to block all crawlers but Google's


#1

About 1/3 of our traffic is from crawlers (about 250K last month). Is there a way to block these but allow Google’s crawlers?


#2

Maybe you should edit your robots.txt file with:

User-agent: Google
Disallow:

User-agent: *
Disallow: /

Source: The Web Robots Pages


#3

Thank you @SidV! That put me on track :+1:

Though, now I run into my limited understanding of Discourse, where can I add this? It seems there is not robot.txt I can edit.


(cpradio) #4

IIRC, there is an outlet you can use in a plugin. Let me see if I can find it.

Edit:

and here is an example of using it in a plugin:


Needing to edit robots.txt file - where is it?
#5

Thanks! I already found those but as a non-programmer that doesn’t help me.

Neither does the How do I install a plugin? explanation.

I guess I’m looking for an edit-discourse-101…


(cpradio) #6

Going from the Sitemap plugin, you need a similar plugin.rb file (bare bones though)

# name: Disable Bots
# about:
# version: 1.0
# authors: whoever
# url: https://github.com/your_github_account/your_repo_name

PLUGIN_NAME = "discourse-disable-bots".freeze

enabled_site_setting :disable_bots_enabled

A config/settings.yml file

plugins:
  disable_bots_enabled:
    default: false

A config/locales/server.en.yml

en:
  site_settings:
    disable_bots_enabled: 'Enable Disable Bots'

A app/views/connectors/robots_txt_index/sitemap.html.erb

<%- if SiteSetting.sitemap_enabled? %>
<%- unless SiteSetting.login_required? %>
Disallow: /

User-agent: Google
Disallow:
<% end %>
<% end %>

For the last one, I’m not sure if that will work or not. The thought is it will be appending Disallow: / under the already existing User-agent: *, but that could be a wrong assumption and you may have to specify it again, such as

<%- if SiteSetting.sitemap_enabled? %>
<%- unless SiteSetting.login_required? %>
User-agent: *
Disallow: /

User-agent: Google
Disallow:
<% end %>
<% end %>

#7

It’s no so easy.
If you choose that way (edit the robots.txt) you should contact your hosting and edit the @cpradio’s files he said. And :warning: please take this as a warning, if you upgrade discourse later, all the modifications will be gone, and you must do it again after you update/upgrade discourse.

Sorry for my intrusion but why are you worried about the consumption of robots on your website?


#8

Well the traffic from the crawlers makes us exceed our plan so I was looking for an easy solution.

But it seems there is none haha.

Thank you for your input though!


(Mittineague) #9

Would it be easier to just make the forum login required?
(search bots don’t log in)


(Rafael dos Santos Silva) #10

That blocks everyone, and for that we have a better suited setting: allow index in robots txt.

He wants to block all but a specific bot, and that’s not easily done.