Seeing anonymous user and crawler traffic, even though site is private

I help run a private Discourse instance, and couldn’t help but notice that there is some recorded anonymous user and Web crawler traffic showing in my dashboard. Now that I look closely, I see that it was happening before too, but in lesser amounts.

I have the “login required” option enabled, and we have our SSO set up to only allow logins for users who meet certain criteria. Is there another setting that I should be marking? Thanks! : )

2 Likes

There shouldn’t be anything additional you need to do… that crawler traffic is likely from crawlers hitting community.yoursite.com/login. If you check community.example.com/admin/reports/web_crawlers you can see how often specific crawlers are hitting your site.

There are a couple things you can do to reduce the crawler traffic…

  • Try disallowing /login from crawlers within robots.txt (community.example.com/admin/customize/robots)… you’d probably see some crawler traffic drop (though probably not completely because there are crawlers out there not obeying robots.txt)

  • Take a look at the worst offenders from /admin/reports/web_crawlers and add their user-agents to the blocked crawler user agents site setting

6 Likes

In addition to what Kris wrote, there will also be an anonymous request for your site’s login page or home page at the beginning of each SSO login request.

Your site’s TOS and Privacy pages can also probably be accessed by anonymous users.

3 Likes

If you are on VPS, or you have Nginx (Apache works too, but Nginx is easier :wink: ) front of Discourse banning bots is quite much easier. IU of Discourse is… not so easy to use, because out there is plenty of bots. Robots.txt is close to useless because quite few is following it, not even Google.

The issue is not knockers trying to reach your Discourse. Everything else those are looking for is.

  • Hundreds script kiddies are testing if you have WordPress and knocking holes, mostly old ones, but still
  • SEO-scrapers and other spiders are trying to analyze your content, mostly because they want to money with it
  • plus of course search engines

Those don’t do any real harm, as breaking in, but serving them costs pure money.

The problem is that your server must answer to all of them. Quite soon majority of the load is coming from bots, not real users. It is totally normal situation when you have around 50 - 500 bots per one actual user.

And you will pay all of this.

I don’t have global audience because my sites, including Discourse, are pure finnish. So I have one powerful tool too, but it can be used only on VPS: geo-blocking.

I’m so sorry our friends from Russia, China, India, Pakistan, Iran, Iraq and Viet Nam, but when I stopped your countries, my bot-load sank about 90 %.

Fighting against bots is never ending struggle. And tools of Discourse, when a forum is not private, are very limited. But sure, better than nothing.

Do not understand me wrong. I’m not wanting that an app should do something that is job of server. I’m just meaning that you can’t rely on Discourse.

3 Likes