Discourse unavailable with high load average


#1

Hello everyone,

I have been running an instance of Discourse for the last 2 years with Digital Ocean without any problem but since the last 2 weeks, it became very slow or unavailable most of the time.

I have upgraded my hosting to the following one, it’s a bit better but still facing the same problems:

  • Ubuntu 14.04 x64
  • 2 vCPUs
  • 4GB / 80GB Disk

This is the output of the htop command:

Any idea where to start fixing this problem? Thanks a lot for your help.


(Matt Palmer) #2

What’s the %iowait? In general, htop is a poor substitute for top for actual diagnostic purposes.


#3
top - 04:41:15 up 1 day, 25 min,  1 user,  load average: 6.32, 7.70, 6.89
Tasks: 162 total,   4 running, 158 sleeping,   0 stopped,   0 zombie
%Cpu(s): 64.2 us, 19.9 sy,  0.0 ni,  6.9 id,  7.7 wa,  0.0 hi,  1.3 si,  0.0 st
KiB Mem :  4048268 total,   116444 free,  2742964 used,  1188860 buff/cache
KiB Swap:  1048572 total,   451544 free,   597028 used.    68040 avail Mem 

(Matt Palmer) #4

Given the 1m load average is nearly half what it was in the previous screenshot, and %iowait is still pretty high, I’d say you’re probably I/O constrained, likely due to swapping. Confirm with sar -W 1 whilst the system is at peak unhappiness, and then upgrade RAM.


#5

Thanks Matt,

Here is the output of the sar -W 1 (I have no idea what it is :confused: )

04:52:49 AM  pswpin/s pswpout/s
04:52:50 AM     95.00      2.00
04:52:51 AM    166.00      0.00
04:52:52 AM     40.00      0.00
04:52:53 AM     19.00      0.00
04:52:54 AM     74.00      0.00
04:52:55 AM    125.00      0.00
04:52:56 AM    247.00      0.00
04:52:57 AM    215.84      2.97
04:52:58 AM     70.00      0.00
04:52:59 AM    334.00      0.00
04:53:00 AM    390.00      0.00
04:53:01 AM    568.00      6.00
04:53:02 AM    702.00      0.00
04:53:03 AM   1047.52      5.94
04:53:04 AM    416.00      0.00
04:53:05 AM    449.00      0.00
04:53:06 AM    691.00      0.00
04:53:07 AM    772.00      6.00
04:53:08 AM    550.00      0.00
04:53:09 AM    181.00      0.00
04:53:10 AM    476.00      0.00
04:53:11 AM    348.00      0.00
04:53:12 AM    316.00      4.00
04:53:13 AM    454.00      0.00
04:53:14 AM    356.00      0.00
04:53:15 AM    911.88      7.92
04:53:16 AM    262.00      0.00
04:53:17 AM    303.00      0.00
04:53:18 AM    271.00      0.00
04:53:19 AM    284.00      6.00

(Matt Palmer) #6

The headings should give you a hint… “pages swapped in per second” and “pages swapped out per second”. Each page is (usually) 4k, so swapping in 1,000 pages in a second means you’re copying about 4MB of data off disk and into RAM. That’s gonna take a while.

You’re definitely due for a RAM upgrade, or doing something to reduce RAM usage (although your unicorns aren’t exactly sitting around chatting over the sports page, so I wouldn’t recommend that).


(Sam Saffron) #7

Running this many unicorn workers on two virtual cpus that already run redis and pg does not make sense.

I would run 3 workers max, and give the leftover ram to postgres.


#8

Thanks guys.

What I don’t understand is that I had 2GB RAM and it was working well, then suddenly I started to have a high load average. I already upgraded to 4GB RAM and I am still facing the same issue, how is it possible that Discourse suddenly needs so much more RAM?

I also changed my app.yml with UNICORN_WORKERS: 3, it was 8 before.

I am rebuilding the app now and I’ll check if there is any changes.


(Sam Saffron) #9

let us know how you go with the safer number


(Matt Palmer) #10

You probably started serving a whole lot more traffic.


#11

Well actually that’s pretty much the opposite :confused: It’s a lot more quiet during summer vacation.

So after 2 hours and around 30 users live:

top - 06:11:09 up 1 day,  1:55,  1 user,  load average: 4.09, 3.24, 2.91
Tasks: 130 total,   7 running, 123 sleeping,   0 stopped,   0 zombie
%Cpu(s): 62.5 us, 26.4 sy,  0.0 ni,  1.4 id,  7.4 wa,  0.0 hi,  2.3 si,  0.0 st
KiB Mem :  4048268 total,   135872 free,  1739644 used,  2172752 buff/cache
KiB Swap:  1048572 total,   975084 free,    73488 used.  1763862 avail Mem 

06:12:12 AM  pswpin/s pswpout/s
06:12:13 AM      0.00      0.00
06:12:14 AM      0.00      0.00
06:12:15 AM      0.00      0.00
06:12:16 AM      0.00      0.00
06:12:17 AM      0.00      0.00
06:12:18 AM      0.00      0.00
06:12:19 AM      0.00      0.00
06:12:20 AM      0.00      0.00
06:12:21 AM      0.00      0.00
06:12:22 AM      0.00      0.00
06:12:23 AM      0.00      6.00
06:12:24 AM      0.00      0.00
06:12:25 AM      0.00      0.00
06:12:26 AM      0.00      0.00
06:12:27 AM      0.00      5.00
06:12:28 AM      0.00      0.00
06:12:29 AM      0.00      0.00

It’s still super slow, and I keep having errors 502 on my side when I try answering a message. Some messages can be posted though.


(Sam Saffron) #12

How many page views are you getting? Where is the traffic coming from? Look at you nginx logs, maybe your server is under some sort of attack


#13

I am having around 20 to 30 page views/minute, around 30 users at the same time.

I am checking the nginx log but not sure how to analyse all those stuffs, I checked that script but seems to be deprecated? Analyzing Discourse Performance using NGINX logs

I also checked the Post-Install Maintenance section and did a apt-get install fail2ban, I am not sure if I have to do anything else but it seems to decrease slightly the load average.


(Jay Pfaffman) #14

I recently had similar problems that went away when I switched to an “optimized” droplet.


#15

Yeah but it’s 5 times more expensive for the size of the disk I need :confused: I unfortunately cannot afford it :confused:


(Jay Pfaffman) #16

My solution was to move backups and uploads to wherever digital ocean Volumes ($10/month for 100GB).


#19

Then shop around :shopping_cart: :slight_smile: . Look at Vultr, Scaleway, Hetzner …


#20

I know this is kinda besides the point now but I want to post in defense of htop that you can enable advanced CPU stats to get the iowait stats and add process state, IO read and write bytes to get a better picture at what processes eat a lot of IO ( linux - htop - show I/O wait percentage - Server Fault )


There’s also a giant list of other optional columns like page faults to play around with in the F2 setup menu :slight_smile:


(Matt Palmer) #21

Let me know when you figure out how to use the F2 setup menu in a screenshot.


#22

Oof, that 30 second length limit on imgur animations is brutal


(using OBS and keycastr in case that’s interesting)