Trying to troubleshoot I/O Wait bottleneck

Cliff notes version:

  • System brought to its knees for a few minutes by iowait under moderate load
  • No significant disk I/O is happening at the time
  • DO swears there is no unusual activity on the hypervisor

Long version:
Currently running on an 8 core/16 GB droplet on DigitalOcean, recently upgraded from the 4/8 for anticipated short term heavier traffic.

At the time the system had db_shared_buffers at 2 GB and unicorn workers at 8. I have since increased both.

Heavy posting to a single topic during a ~2 hour live event, approximately 200 total users in that time period, averaging around 70 at any moment according to Analytics.

System was stable for over an hour. Load averages stayed around 4, CPU at about 30%, memory around 35%. No significant disk activity, no problems with the forum.

All at once the number of users per Analytics doubled. I do not know if these were real users or some artifact of a different problem. We are not a large or notable community. Nobody links to us, and it would be very unusual to suddenly gain a bunch of new users.

At the same time, iowait went very red with sustained 50%+ warnings. 1 min load averages spiked to 12, 5 min exceeded 8. The entire system became very unresponsive, the forum slowed to a crawl, and there were lots of high load temporary logouts. CPU never exceeded 40%.

This lasted for 3-4 minutes, then everything tapered back down to normal over the next 10. Disk I/O stayed under 1 MB/s the whole time according to DO graphs, mostly writes.

Is it possible for this to be a forum/configuration issue? Is DigitalOcean lying and it’s really a hardware problem? If it’s the former, any suggestions for preventing it happening again (this was not the first time)?

I have another, larger, one-off heavy traffic event coming up shortly which will be very important to the community. I don’t have the luxury of trial and error to see what works. I only have one shot to get it right (or wrong).

1 Like

I am not sure if this is related to your issue, but recently there has been a lot of discussion about live events and the activity spikes they cause.