Could this indicate UNICORN_WORKERS being set too high/low?
The server has 64GB RAM (usually shows around 40GB free) and 6 cores, there are 4 Discourse instances on the server each set to UNICORN_WORKERS: 8
Any ideas or tips on what’s causing it or what to try? (One of the forums is in read-only mode and doesn’t get much traffic, should it be set to have fewer workers?)
Thanks for the replies everyone - not sure where I read it now but I always thought we were to set 2 workers per core. I’ve dropped the workers down now per forum, allocating more to the busiest forums and less to the ones not as busy. I’ll monitor things over the next week and report back if it hasn’t help.
In your case you aren’t allocating two workers per core though. You have six cores which would mean twelve workers, but you have four instances each using eight workers, so 32 total.
Yep… I’ve adjusted so the total number of workers is not greater than twice the number of cores, though I still wonder - what’s the correct/standard advice, what you said or what was in Nate’s post, where he quotes Jeff saying 1 worker per core?
From my own experiments, 1 worker per core results in timeouts (but lowers server load) more workers results in better performance but higher load (which on my server is still within an acceptable range).
Take a look at discourse-setup, which handles the scaling for new installs today:
# UNICORN_WORKERS: 2 * GB for 2GB or less, or 2 * CPU, max 8
if [ "$avail_gb" -le "2" ]
then
unicorn_workers=$(( 2 * $avail_gb ))
else
unicorn_workers=$(( 2 * $avail_cores ))
fi
unicorn_workers=$(( unicorn_workers < 8 ? unicorn_workers : 8 ))
That second statement, using double the number of available cores, is the default on systems with more than 2GB RAM. It looks as though your issue is more down to a tug-of-war between your instances (host resources), rather than a discourse problem.
I’m seeing the same thing after my last upgrade, which was one day after the OP, so I don’t think this has anything to do with the number of unicorn workers. The unicorn.conf.r* process is suspicious, because the original post of this topic is the only hit for that term on the entire web. I believe unicorn.conf.rb would be more normal.
I’ve just upgraded to the latest Discourse and haven’t seen anymore unicorn.conf.r* (now anything around the 80% cpu mark is just ruby, though seems less frequent). Loads are around the same (though lower than they were after I made those worker adjustments).
Have you upgraded to the latest version? What kind of hardware are you on and how busy in your forum?
Yes, I’m at 3.4.0.beta4-dev. That’s what started the high CPU usage. Nothing else changed.
8 GB RAM, 2 vCPUs, 160 GB SSD with plenty of space.
I posted the CPU usage above for my production site, which has around 30 users online at a time. But I have a test site with the same issue and there is absolutely no traffic and no plugins there. CPU usage before and after updating (spikes are daily backups):
I’m not sure whether our situations are related Mark. I think in my case what Stephen said played a large part:
I recently moved two other instances on to the same server and had actually forgotten that the unicorn workers were set to 8 because previously we were on a server with more cores (but it had it’s own problems hence we moved back to a Xeon which had fewer cores but performed better overall).
So what I found was reducing the unicorn workers on this server reduced load, but started giving us timeouts, increasing them eradicated timeouts but resulted in a higher load - though still within an acceptable range. I think I could increase workers and we could still handle the increased load, but what we have now is good for now.
Having said that, I had moved the instances on to the same server and it was running within what I would have expected (so load increased but not by a huge amount) and it did feel that an update resulted in higher loads… however I cannot be sure of that, and we have to keep in mind that from time to time with Discourse getting more features it may require more powerful hardware or result in sometimes feeling ‘slower’ (I had some Discourse instances on old versions and they felt noticeably snappier - though of course they didn’t have all the features of the newer versions).
Having said that as well, I think loads have actually decreased a little since the latest Discourse update (with PG 15).
I’m not sure what to suggest for you Mark - maybe play around with workers and some of the other settings too? Such as db_shared_buffers and db_work_mem? Perhaps start a dedicated thread along the lines of “High CPU usage after update - does my instance need perf tweaks?” Or something like that
I upgraded tonight and immediately saw a difference in CPU usage on my site. Here is a graph of before, during, and after the upgrade. This represents a one-hour duration.
So it looks like my issue wasn’t the unicorn workers after all - after @sam’s update following @LotusJeff’s thread the server loads have gone back to what they were (less than half of what they had gone up to)…
I probably wouldn’t have noticed if I hadn’t been keeping an eye on the server after having recently moved the other two forums on it - I wonder how many people it affected without them even realising?
Does the Discourse team have measures put in place to alert them of issues like this? Perhaps a volunteer program that admins can set up for specific topics, eg, “Send server loads to Discourse within XX hours/days/weeks before/after an upgrade” Or better still track these locally and then alert admins when server load increases are noticed after upgrades - which we can then post here if need be…
I probably would not have noticed the impact, but I am monitoring the server closely because we migrated to Discourse about 2 weeks ago. I am in the weeds doing various post-migration validations (backup run, etc.). After a couple of months, I would never have noticed the impact.
I would hope that discourse has a daily load test running. In my past life, I had a server that would rebuild daily with committed code. It had simulated users using the server all day. We measured key performance metrics from a user perspective and a server perspective. It allowed us to proactively catch memory leaks, inefficient code, and unexpected changes to UX.
I still have to give Kudos to Sam and the team. Coming from the land of phpBB, where something like this would take decades to solve and remedy, I found the fast response terrific. (Even if it meant staying up to 2am CT time compared to Sydney time.)