There are a few crawlers on our site. Is there any risk that they have access to the content?
What is an « acceptable crawler load/risk » before I would have to undertake blocking procedures with which I have few, if any, expertise?
There are a few crawlers on our site. Is there any risk that they have access to the content?
What is an « acceptable crawler load/risk » before I would have to undertake blocking procedures with which I have few, if any, expertise?
They can only crawl public sites, meaning there is no security breach. But yes, they can access public content.
When the load is so high it has negative impact and you should buy more CPU and/or RAM. Well, I don’t know how easily that can happend on Discourse because solution is different, but PHP-based WordPress is quite easy to put on its knees. But Discourse is serving static and lightweight content for bots, if it knows who are human and who are not. And if a bot gives strongly false user agent what it can get… a lot of texts in JSON?
If a bot makes its way thru login, trust level barrier etc. I would guess the team will be in panic mode and every hand must back to work right away
Also note that you can easily block crawlers via your admin settings.
I would be so grateful to know how…
Hopefully that isn’t just editing robots.txt because it works only with good behaving ones. There is actually only one effective but a bit more difficult solution: reverse proxy.
That approach is effective – we use it ourselves and recommend it to those on our hosting.
Should I read that Discourse is using filtering?
I’m not sure what you’re asking. We don’t block anything by default but we provide admins with the tools to be selective.
So you are trusting bots will a) read robots.txt and b) follow rules. Well, bad behaving ones don’t do either of those. And we are back where we started: if bots are anykind issue reverse proxy is the best solution.
Thanks. That I would like to know.
Ah, I see what you’re getting at. No, we’re not assuming that all bots identify themselves as crawlers or follow the rules – it’s definitely an inexact science. I was simply offering a first point of mitigation to the OP.
We are currently working on ways to restrict traffic more specifically, but it’s not an easy task.
Have noticed crawler numbers are much lower on discourse hosted site than digital ocean server site, with default admin settings for those.
Hosted site usually has less than ten crawlers a day, average of about 4. Sometimes there are spikes such as the last day of this most recent January had 77 crawlers that day.
Digital ocean site with almost no activity has average of about 30 crawlers a day, don’t know why if this matters the kind of server or domain why there would be more crawlers?
These are generally searching/indexing public sites + content for search engines to be able to find them, which can be a good thing for sites if you’re wanting to reach a broader audience then people can find your site if they are running a search for something being talked about at a discourse site.
There may be other purposes of crawlers don’t know what all they are for. These ones are denied access by default in the settings which you probably know about already:
Being a relatively computer illeterate, I have been following your expert opinions on crawling somewhat like a handicapped spectator watching the final game of the US Open… Thanks for introducing me to this puzzling part of site security.
Our forum so efficiently hosted by Discourse is a highly confidential one. Users joigning in on invitation are very nervous about confidentiality and I am trying to reassure them the best I can. Crawlers may not be too harmfulI (?!), but I would like to keep them off completely if possible, they are of no use to us as we have no interest in our content being indexed or known in any way.
I now realize that optimizing settings is the first thing to do. Is it possible that my settings be examined by one of the Communitech support group in that respect?
Thanks for your attention.
Ah, that’s good to see, I thought it just relied on Redis to more quickly serve up recently rendered content. As you mention, when my forum was running on Drupal the bad bots and sometimes even the search engine crawlers would occasionally bring it to its knees. But I installed a plugin that created a static HTML file cache of anonymously accessed pages and automatically created Nginx rewrite rules for them. Nginx would serve those without bootstrapping the Drupal PHP code and it was just blistering fast and could handle way more anonymous traffic load.
Hey there. It’s pretty important to note that this has no security ramifications. Crawlers only have access to public sites. If you have a login only site they won’t have access.
One other clarification is that Communiteq is not affiliated in any way with us so if they are your hosts you are not hosted by Discourse.
I planned to send a private response but this might be helpful to others as well so I’m posting it here.
They are accessing your home (login) page only and they’re not able to access the content.
They can be. Depending on the type of crawler they could be making information accessible you didn’t want to be accessible. Technically speaking, a crawler can only access public information, but a crawler (and associated search engines) are very good at discovering information and making it accessible.
So let’s take a look at your situation.
Your robots.txt
shows
User-agent: *
Disallow: /
so it’s set to deny all search engine crawlers.
However, this alone is not enough, since robots.txt
is based on politeness and is not honored by “bad” robots. A bad robot can simply choose to ignore robots.txt
. It’s like a “keep out!” sign - a burglar will not honor it.
The main security of your forum is based on the fact that you have login required
enabled. That’s enough to keep any crawler out.
Although we have already determined that crawlers are not able to get in, it might be good to take this a step further.
You also have invite only
and allow new registrations
enabled, and invite allowed groups
is set to TL2. This means that arbitrary people cannot sign up, but any user in TL2 or higher will be able to invite other users to the community. As a safety net you have enabled must approve users
, so that’s good. The only way to gain access to your community is to get invited by someone who’s already a trusted member of the community, and an admin needs to let you in.