Discourse unavailable with high load average

Hello everyone,

I have been running an instance of Discourse for the last 2 years with Digital Ocean without any problem but since the last 2 weeks, it became very slow or unavailable most of the time.

I have upgraded my hosting to the following one, it’s a bit better but still facing the same problems:

  • Ubuntu 14.04 x64
  • 2 vCPUs
  • 4GB / 80GB Disk

This is the output of the htop command:

Any idea where to start fixing this problem? Thanks a lot for your help.

What’s the %iowait? In general, htop is a poor substitute for top for actual diagnostic purposes.

top - 04:41:15 up 1 day, 25 min,  1 user,  load average: 6.32, 7.70, 6.89
Tasks: 162 total,   4 running, 158 sleeping,   0 stopped,   0 zombie
%Cpu(s): 64.2 us, 19.9 sy,  0.0 ni,  6.9 id,  7.7 wa,  0.0 hi,  1.3 si,  0.0 st
KiB Mem :  4048268 total,   116444 free,  2742964 used,  1188860 buff/cache
KiB Swap:  1048572 total,   451544 free,   597028 used.    68040 avail Mem 

Given the 1m load average is nearly half what it was in the previous screenshot, and %iowait is still pretty high, I’d say you’re probably I/O constrained, likely due to swapping. Confirm with sar -W 1 whilst the system is at peak unhappiness, and then upgrade RAM.

Thanks Matt,

Here is the output of the sar -W 1 (I have no idea what it is :confused: )

04:52:49 AM  pswpin/s pswpout/s
04:52:50 AM     95.00      2.00
04:52:51 AM    166.00      0.00
04:52:52 AM     40.00      0.00
04:52:53 AM     19.00      0.00
04:52:54 AM     74.00      0.00
04:52:55 AM    125.00      0.00
04:52:56 AM    247.00      0.00
04:52:57 AM    215.84      2.97
04:52:58 AM     70.00      0.00
04:52:59 AM    334.00      0.00
04:53:00 AM    390.00      0.00
04:53:01 AM    568.00      6.00
04:53:02 AM    702.00      0.00
04:53:03 AM   1047.52      5.94
04:53:04 AM    416.00      0.00
04:53:05 AM    449.00      0.00
04:53:06 AM    691.00      0.00
04:53:07 AM    772.00      6.00
04:53:08 AM    550.00      0.00
04:53:09 AM    181.00      0.00
04:53:10 AM    476.00      0.00
04:53:11 AM    348.00      0.00
04:53:12 AM    316.00      4.00
04:53:13 AM    454.00      0.00
04:53:14 AM    356.00      0.00
04:53:15 AM    911.88      7.92
04:53:16 AM    262.00      0.00
04:53:17 AM    303.00      0.00
04:53:18 AM    271.00      0.00
04:53:19 AM    284.00      6.00

The headings should give you a hint… “pages swapped in per second” and “pages swapped out per second”. Each page is (usually) 4k, so swapping in 1,000 pages in a second means you’re copying about 4MB of data off disk and into RAM. That’s gonna take a while.

You’re definitely due for a RAM upgrade, or doing something to reduce RAM usage (although your unicorns aren’t exactly sitting around chatting over the sports page, so I wouldn’t recommend that).

1 Like

Running this many unicorn workers on two virtual cpus that already run redis and pg does not make sense.

I would run 3 workers max, and give the leftover ram to postgres.

Thanks guys.

What I don’t understand is that I had 2GB RAM and it was working well, then suddenly I started to have a high load average. I already upgraded to 4GB RAM and I am still facing the same issue, how is it possible that Discourse suddenly needs so much more RAM?

I also changed my app.yml with UNICORN_WORKERS: 3, it was 8 before.

I am rebuilding the app now and I’ll check if there is any changes.

let us know how you go with the safer number

You probably started serving a whole lot more traffic.

1 Like

Well actually that’s pretty much the opposite :confused: It’s a lot more quiet during summer vacation.

So after 2 hours and around 30 users live:

top - 06:11:09 up 1 day,  1:55,  1 user,  load average: 4.09, 3.24, 2.91
Tasks: 130 total,   7 running, 123 sleeping,   0 stopped,   0 zombie
%Cpu(s): 62.5 us, 26.4 sy,  0.0 ni,  1.4 id,  7.4 wa,  0.0 hi,  2.3 si,  0.0 st
KiB Mem :  4048268 total,   135872 free,  1739644 used,  2172752 buff/cache
KiB Swap:  1048572 total,   975084 free,    73488 used.  1763862 avail Mem 

06:12:12 AM  pswpin/s pswpout/s
06:12:13 AM      0.00      0.00
06:12:14 AM      0.00      0.00
06:12:15 AM      0.00      0.00
06:12:16 AM      0.00      0.00
06:12:17 AM      0.00      0.00
06:12:18 AM      0.00      0.00
06:12:19 AM      0.00      0.00
06:12:20 AM      0.00      0.00
06:12:21 AM      0.00      0.00
06:12:22 AM      0.00      0.00
06:12:23 AM      0.00      6.00
06:12:24 AM      0.00      0.00
06:12:25 AM      0.00      0.00
06:12:26 AM      0.00      0.00
06:12:27 AM      0.00      5.00
06:12:28 AM      0.00      0.00
06:12:29 AM      0.00      0.00

It’s still super slow, and I keep having errors 502 on my side when I try answering a message. Some messages can be posted though.

How many page views are you getting? Where is the traffic coming from? Look at you nginx logs, maybe your server is under some sort of attack

I am having around 20 to 30 page views/minute, around 30 users at the same time.

I am checking the nginx log but not sure how to analyse all those stuffs, I checked that script but seems to be deprecated? Analyzing Discourse Performance using NGINX logs

I also checked the Post-Install Maintenance section and did a apt-get install fail2ban, I am not sure if I have to do anything else but it seems to decrease slightly the load average.

I recently had similar problems that went away when I switched to an “optimized” droplet.

Yeah but it’s 5 times more expensive for the size of the disk I need :confused: I unfortunately cannot afford it :confused:

My solution was to move backups and uploads to wherever digital ocean Volumes ($10/month for 100GB).

Then shop around :shopping_cart: :slight_smile: . Look at Vultr, Scaleway, Hetzner …

I know this is kinda besides the point now but I want to post in defense of htop that you can enable advanced CPU stats to get the iowait stats and add process state, IO read and write bytes to get a better picture at what processes eat a lot of IO ( linux - htop - show I/O wait percentage - Server Fault )

There’s also a giant list of other optional columns like page faults to play around with in the F2 setup menu :slight_smile:


Let me know when you figure out how to use the F2 setup menu in a screenshot.

Oof, that 30 second length limit on imgur animations is brutal

(using OBS and keycastr in case that’s interesting)