High load due to peak anonymous sessions, increase unicorn workers?

We just had peak traffic of approx 1,500 concurrent (mostly anonymous) users visiting a single page.

And the forum went into the mode, where the warning is displayed to all members concerning high load.

CPU-Optimized Digital Ocean Droplet

Dedicated CPU: 4 vCPUs
RAM: 8 GB

Unicorn workers: 10

Given only approx. 50% of RAM and CPU are utilized would it help to increase unicorn workers for such peak traffic cases from anonymous visitors or not?

1 Like

Yes, increasing unicorns is the first step here.

4 Likes

I have increased the workers to 24. No difference (goes still into “Due to extreme load, this is temporarily being shown to everyone as a logged-out user would see it.”), with a similar concurrent visitor peak (99% anonymous) just now:

1 Like

I know @sam has spent a lot of time on this recently and might have commentary?

1 Like

@sam Any ideas on how to further optimize for peak traffic from anonymous traffic (e.g. if a singe topic goes viral on social media). In both cases outlined above, memory and CPU still have plenty of room (according to Digital Ocean), and we have not even hit a load of 4, still, the forum goes into “extreme load” mode, despite tripled the number of workers.

1 Like

Just went into “extreme load mode” again, with only 600 concurrent visitors total (99% anonymous) and a load of not even 1.

You need to collect some data so we know what is the bottleneck.

Prometheus exporter plugin for Discourse

2 Likes

I believe the DO data monitor is not sensitive enough and somewhat misleading. I experimented with extreme load with Hetzner and Digital ocean. With Hetzner when the extreme load message came up there was a short sharp peak where it would go to 120%.

It lasted maybe a second, before dropping down to 40-50% mark.

Recreated the same thing with Digital ocean , and from memory it appeared CPU usage never got above 50%. (but you could not change the x axis to the seconds level)

My guess is DO CPU level is maybe the average of 5 or 15 seconds. So you dont see the short sharp peaks

2 Likes

We are going to need prometheus exporter reports to look any deeper.

If you have the ram and the cpu … you can always add more unicorn workers, that will scale up for these peaks. You just don’t want to swap memory, cause performance will go way down.

2 Likes

Seems like in such a case that single topic page should be able to be cached and served statically for a short period without having to hit the back end at all. I’ve no idea if Discourse can do that (i.e. set cache control headers when under load and serving content to anon users) and if the DO setup has a capable caching proxy in the chain, but it’s a development idea that might be worth a thought if I’m not totally wrong and it isn’t already done.

Maybe @sam already thought or did this, or knows why it is a bad idea!

1 Like

That already happens dynamically under measured load on a per-topic basis, that’s exactly what

… is referring to. It’s READ ONLY though, so people can’t actually have conversations in that mode.

Yep, but my suggestion is to boot just the anon users to a cached page with a short time out (60s?) to take their load off in the hope that the rest of the site can keep going in read-write mode.

2 Likes

That would be great. Currently, if we feature a topic on our 200,000+ Telegram channel, it puts the entire Discourse site into the “read-only” mode for almost one hour. Although logged-in users are just around 50 (99% is anonymous traffic).

1 Like

This already happens, we have pretty aggressive caching direct in Redis for anon users on topic list pages and topic pages. 60s timeout.

3 Likes

I will try to get Prometheus running in order to find out the bottleneck, but probably it’s DOs monitoring that is lagging as mentioned by @Alec. If this is the case, a larger machine is the way forward I assume?