Trying to troubleshoot I/O Wait bottleneck

torsi · October 30, 2020, 2:51am

Cliff notes version:

System brought to its knees for a few minutes by iowait under moderate load
No significant disk I/O is happening at the time
DO swears there is no unusual activity on the hypervisor

Long version:
Currently running on an 8 core/16 GB droplet on DigitalOcean, recently upgraded from the 4/8 for anticipated short term heavier traffic.

At the time the system had db_shared_buffers at 2 GB and unicorn workers at 8. I have since increased both.

Heavy posting to a single topic during a ~2 hour live event, approximately 200 total users in that time period, averaging around 70 at any moment according to Analytics.

System was stable for over an hour. Load averages stayed around 4, CPU at about 30%, memory around 35%. No significant disk activity, no problems with the forum.

All at once the number of users per Analytics doubled. I do not know if these were real users or some artifact of a different problem. We are not a large or notable community. Nobody links to us, and it would be very unusual to suddenly gain a bunch of new users.

At the same time, iowait went very red with sustained 50%+ warnings. 1 min load averages spiked to 12, 5 min exceeded 8. The entire system became very unresponsive, the forum slowed to a crawl, and there were lots of high load temporary logouts. CPU never exceeded 40%.

This lasted for 3-4 minutes, then everything tapered back down to normal over the next 10. Disk I/O stayed under 1 MB/s the whole time according to DO graphs, mostly writes.

Is it possible for this to be a forum/configuration issue? Is DigitalOcean lying and it’s really a hardware problem? If it’s the former, any suggestions for preventing it happening again (this was not the first time)?

I have another, larger, one-off heavy traffic event coming up shortly which will be very important to the community. I don’t have the luxury of trial and error to see what works. I only have one shot to get it right (or wrong).

ljpp · October 30, 2020, 5:55am

I am not sure if this is related to your issue, but recently there has been a lot of discussion about live events and the activity spikes they cause.

Topic		Replies	Views
Discourse unavailable with high load average Support	21	2380	April 26, 2021
Strange, random performance issues on our Forum Support	4	470	September 30, 2021
I just hit my CPU cap on the Digital Ocean 2GB/2xCPU plan Hosting	35	17517	April 30, 2018
Extreme load error Installation	19	1500	August 13, 2023
Woke up to "too many requests error" Support	11	1027	December 1, 2022

Trying to troubleshoot I/O Wait bottleneck

Related topics