So I noticed something funny between the 15th to the 16th of February 2022 in our self-hosted DigitalOcean Discourse instance. As the picture below shows, the number of anonymous users per day jumped from roughly 1 000 per day to averaging 10 000 per day. I have tried to understand the origin of this but to no avail.
I can not seem to match it either with the views on our content or with the stats from Google Search Console or Google analytics. We also tried checking the logs, but nothing much.
Does anybody have an idea what can be the cause of this?
That would be near impossible if not tricky to answer in a way that it speaks to your specific site. What you can do though to start to figure it out is to look at the crawler report in your dashboard to see if it is caused by crawlers.
Also, I edited your topic title to make it more descriptive
The entity doing the request is the one who identifies itself as either a “normal” user or a bot. It’s a honor based system, with all ups and downs from that.
Most bad actors in the bot ecosystem won’t identify as such and will issue requests disguised as “normal” users, and there is not much Discourse can do in those cases.
If you are comfortable with the command line you long into your server and use the following to track where most requests are coming from:
Of course, nothing else is needed that a bot is identified itself as an user. Changing an user agent is really trivial thing — even your browser can it. And Discourse knows only those bots that is using… well, known UA
Sure those can be real users too if somewhere more high traffic site is a link to you.
My guess is that the PDF uploaded there is something that got linked elsewhere and lots of people are downloading it directly? Is that PDF something that got uploaded by a bad actor and is getting lots of traffic for some reason?
Thanks, @pfaffman but there’s no problem with the pdf, I actually uploaded it myself. I was just showing the picture to indicate that there’s no correlation with the thousands of anonymous users discourse is showing.
The command line you provided has helped us trace the IPs responsible for the jump. For now, we are going to continue our observation before deciding if we want to block the crawlers.
Just to note, in my case the great majority of accesses are POSTs to a message-bus endpoint. In other words, probably user’s browsers. In one case every minute and in another case much more often.
Those are most of the requests in any Discourse site indeed, but they aren’t counted as pageviews, so won’t be reflected on the “Consolidated Pageviews” graph on the dashboard, making this a bit off-topic.