How to disable indexing by crawlers


(omfg) #1

I wonder what’s the best place to disable indexing (set noindex) on the entire site, and I mean across the board.
Is it enough to simply insert NOINDEX in Discourse header (I wonder if that would apply to all pages/URLs generated by the site)?

P.S. And for misbehaving crawlers such as some search engines, where’s the best place to block countries, is it before Docker NAT or after, inside of Docker? I’d like to block entire countries. It seems it’s cheapest to drop connections with iptables before they hit the container, but I am not sure if that’s a good practice.


(Mittineague) #2

Interesting question. I think I’ve never heard of anyone wanting to block search engines from an entire site. It’s usually the opposite.

I also don’t know, but I assume, that any “behaving” bots would observe both nofollow, noindex in page heads and an all inclusive disallow in the robots.txt but maybe some do one and not the other?

Sorry, I can’t offer anything in terms of a server doing the blocking. But maybe you could require registration and manual approval to keep the not-wanteds from accessing the site.
They could still hit the site, but all they’d get is a “sorry”.


(Jeff Atwood) #3

There is a boolean site setting for this, search for “index” or “robots”.


(Michael - DiscourseHosting.com) #4

One of the most frequent use cases for this is to keep a site out of search engines before it is live to the general public, without having to require everyone to login in order to be able to see any content.

The setting is Security -> allow index in robots txt


#5

We are using our discourse site as a internal knowledge base for our company (discourse.example.com). We do not want search engines to crawl this website whatsoever.

The three options I see in settings are:

What security settings should we change in addition to disabling the “allow index in robots txt”?


(Michael - DiscourseHosting.com) #6

Every decent search engine honors robots.txt so that should be enough.
If telling them nicely doesn’t work then the only way to keep them out is making your forum private, or blocking certain IP addresses.


(Joshua Rosenfeld) #7

I do the exact same thing. We did two things to keep our site secure:

  1. Require Login. Not everyone at the “company” (university in my case) should have access, only certain staff, so we restricted access this way.
  2. No port 80 access through the firewall. You cannot access Discourse (or anything on the specific server for that matter) from outside the company network. This includes search engines (who are already prevented by #1, but I digress). If an employee needs access offsite, they use the VPN.

#8

Excellent info - thank you, @jomaxro ! We would love to do #2, but our small company works out of a co-working space (we don’t own the network). Do you have any suggestions for a workaround, or do you think that #1 would be sufficient?


(Joshua Rosenfeld) #9

#1 is absolutely sufficient. Any unauthenticated viewer (like Google) would see the following page (customizable). There is no way to access any content, and thus no way to crawl anything. Adding NOINDEX should be respected, but if not the worst case is hitting a login screen:

As for #2, it’s not in place to prevent indexing of Discourse, but for overall security of the server (there are other apps besides Discourse). Company network security policy is not to open the firewall for a server unless there is a compelling reason to do so. The only use of the server is by students/employees, all of whom have VPN credentials, so the server doesn’t get firewall access.


#10

So if I’m storing a lot of sensitive information on our discourse forum, what do you suggest we do to make it as secure as possible? (we don’t want intruders that find intranet.ourcompany.com, to be able to get in and read sensitive product and strategy info).

Should I just remove our CNAME and have everyone direct to the IP address of the hosted server? I know these questions a bit novice, but security is not my strong suit.


(Kane York) #11

Login required is a pretty massive lock-down. So go with login required, must approve users, invite only. With that setup, only accounts that the site staff send invites to can see anything on the site.


(David Collantes) #12

I strongly recommend you hire someone who knows. Answers to your question have already been explained on the thread.


#13

I was responding to that statement and requesting additional info. Read the thread.


#14

Login required is a pretty massive lock-down. So go with login required, must approve users, invite only. With that setup, only accounts that the site staff send invites to can see anything on the site.

Thanks, @riking - this is very useful. We’ll likely contract someone to help with additional security measures. Do you know if providing SSL has any added benefit for external penetration risks?


(Kane York) #15

Yes, SSL is necessary if you want to prevent network-based snooping on the data or your login info. It won’t prevent shoulder-based snooping, of course.

With a login-required forum, the risk of the forum’s domain name leaking is minimal - all an attacker learns is (a) the version of Discourse (considered non-sensitive information) and (b) The site title and login page message.

The domain name “leaking” is always going to happen anyway for anything accessible outside a segmented network. So using Let’s Encrypt would be perfectly fine (they publicly log all issued certificates, which is why I bring this up).


#16

I’ll definitely move forward with Let’s Encrypt. Excellent info, thanks @riking!