Smarter handling of random crawler traffic

Following:

https://github.com/discourse/discourse/commit/7b562d2f46c60df5323ac06731cf341d95d85027

We now have much improved crawler/bot detection. In the past we have seen many customers complain that pageviews in Google analytics are way off from pageviews in Discourse. This change much more accurately splits up the traffic so both reports are more inline.

While analysing this problem I noticed that we are allowing a gigantic amount of crawling from many probably pointless sources.

https://gist.github.com/SamSaffron/6cfad7ea3e6df321ffb7a84f93720a53

DotBot is using up more traffic than google, bing is on a rampage, weird things like magpie crawler eat tons of traffic.

This is all traffic that forum operators pay for and usually they get very low value out of it.

I think we should consider:

  1. Making a very easy dropdown in site settings with:

Allowed Crawler Traffic:

  • Strict: robots.txt blocks everything but a list of common crawlers, additional rate limits and blocks are in place to ensure this is enforced

  • Open: crawler traffic is only limited via global rate limits (default)

  1. Add a site setting that lists the crawlers you want to allow

Allowed_crawler_user_agents:

Only used if “strict” crawler traffic is enforced. A list of user agents (with potential wildcards) which are allowed.

We have to make sure our “strict” mode allows some bot traffic through, it is unlikely you want to disable oneboxing to your forum from every source in the web, but in strict mode you would want to heavily throttle that.

A lot of the planning here for this change needs to be around how “strict” mode works.

The open bots vs crawlers question?

At the moment we use the term crawler to mean “very likely not a human using a browser”. It encoumpases bots like wget and curl and crawlers like bing and google.

There is an open question on if we should split the “crawler” buckety in to 2 but I am unsure here.

18 Likes

This is now complete thanks to @neil!

You can see “top crawlers” at https://yoursite/admin/reports/page_view_crawler_reqs

You can blacklist bad crawlers by setting blacklisted crawler user agents

Alternatively, if you wish to only allow particular crawlers, you can set the whitelist with the setting whitelisted crawler user agents

10 Likes