Following:
https://github.com/discourse/discourse/commit/7b562d2f46c60df5323ac06731cf341d95d85027
We now have much improved crawler/bot detection. In the past we have seen many customers complain that pageviews in Google analytics are way off from pageviews in Discourse. This change much more accurately splits up the traffic so both reports are more inline.
While analysing this problem I noticed that we are allowing a gigantic amount of crawling from many probably pointless sources.
https://gist.github.com/SamSaffron/6cfad7ea3e6df321ffb7a84f93720a53
DotBot is using up more traffic than google, bing is on a rampage, weird things like magpie crawler eat tons of traffic.
This is all traffic that forum operators pay for and usually they get very low value out of it.
I think we should consider:
- Making a very easy dropdown in site settings with:
Allowed Crawler Traffic:
-
Strict: robots.txt blocks everything but a list of common crawlers, additional rate limits and blocks are in place to ensure this is enforced
-
Open: crawler traffic is only limited via global rate limits (default)
- Add a site setting that lists the crawlers you want to allow
Allowed_crawler_user_agents:
Only used if “strict” crawler traffic is enforced. A list of user agents (with potential wildcards) which are allowed.
We have to make sure our “strict” mode allows some bot traffic through, it is unlikely you want to disable oneboxing to your forum from every source in the web, but in strict mode you would want to heavily throttle that.
A lot of the planning here for this change needs to be around how “strict” mode works.
The open bots vs crawlers question?
At the moment we use the term crawler to mean “very likely not a human using a browser”. It encoumpases bots like wget and curl and crawlers like bing and google.
There is an open question on if we should split the “crawler” buckety in to 2 but I am unsure here.