Higher load and 500 internal servers errors since 28th Dec 2017 upgrade


(Chris Croome) #1

I upgraded Discourse at https://community.coops.tech/ on 28th December and since then there have been a couple of occasions when the site has returned 500 errors and I have had to restart (via ./launcher restart app) and I have noticed a change in the pattern of some of the Munin graphs generated by the server.

The load has been consistently higher:

The memory usage has been growing (the drops are when I have restarted the app due to 500 errors):

And the memory graph appears to correspond to the threads graph:

This server has 2GB of RAM, I could add some more and reboot it but before I try that I was wondering if anyone had any suggestions regarding how to get to the bottom of this issue?

I was going to post which version of Discourse the server is running but the admin page doesn’t fully load, I just get the navigation menu at the top:


(Chris Croome) #3

@dev11 sorry but I don’t find your reply very helpful and I would suggest you migrate to some other forum.

Looking at the Docker logs for the container there are a huge number of lines like this:

2017-12-30 10:10:26.719 UTC [15341] discourse@discourse FATAL:  sorry, too many clients already
2017-12-30 10:10:26.727 UTC [15339] discourse@discourse FATAL:  remaining connection slots are reserved for non-replication superuser connections

(Chris Croome) #4

A few other graphs, processes:

CPU usage:

And open inodes:


(Chris Croome) #5

One further change that was made on 28th December — Docker CE was upgraded from 17.09.1-ce (2017-12-07) to Docker version 17.12.0-ce, build c97c6d6, the release notes for this version do not appear to yet be available.


(Chris Croome) #6

The open files graph is also noteworthy:


(Gerhard Schlager) #7

It could be a problem with the latest Docker version. We have seen problems with 17.11 too. @mpalmer could probably tell you a lot about that and recommended downgrading to 17.10 – it could be worth trying in your case too.


(Chris Croome) #8

The thread you referenced on GitHub contains a post from @mpalmer which appears to indicate that the problem can’t be reproduced with Docker 17.12.0-ce, and this is the version I’m running:

docker -v
Docker version 17.12.0-ce, build c97c6d6

I have noticed that this is a large number of postgres processes:

ps aux | grep postgres | wc -l
75

And top indicates that one of the two CPUs is running with 99% iowait:

top - 16:11:21 up 105 days, 16:56,  1 user,  load average: 1.02, 1.02, 1.00
Tasks: 181 total,   1 running, 180 sleeping,   0 stopped,   0 zombie
%Cpu0  :  1.3 us,  0.7 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.3 us,  0.7 sy,  0.0 ni,  0.0 id, 99.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2044304 total,    62452 free,  1466288 used,   515564 buff/cache
KiB Swap:  2097148 total,  2037460 free,    59688 used.   466492 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
17021 discour+  20   0 1696812 301140  25900 S   1.3 14.7   0:41.94 ruby
10815 discour+  20   0  522884 166456   7764 S   0.3  8.1   0:02.49 ruby
12571 discour+  20   0   43272   3356   2696 R   0.3  0.2   0:00.12 top

(Chris Croome) #9

This appears odd to me, lsof /var/lib/docker returns 4894 lines, which I have posted here — checking on another Discourse server (which doesn’t have a separate /var/lib/docker partition) there are only 2348 open files when I run lsof / | wc -l.


(Chris Croome) #10

There are even more open files for the /var/discourse partition:

lsof /var/discourse | wc -l
11095

I have posted the output here, most the processes are postgres:

lsof /var/discourse | grep postgres | wc -l
11132

It appears to me that for an unknown reason postgres is opening files until the partition runs out of inodes and then Discourse starts returning internal server errors until the container is restarted, the current number of open inodes on this server:

df -i
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
udev            248128    279  247849    1% /dev
tmpfs           255538    334  255204    1% /run
/dev/xvda2      655360  80447  574913   13% /
tmpfs           255538      1  255537    1% /dev/shm
tmpfs           255538      3  255535    1% /run/lock
tmpfs           255538     15  255523    1% /sys/fs/cgroup
/dev/xvda4     1966080 581268 1384812   30% /var/lib/docker
/dev/xvda3     1310720   4318 1306402    1% /var/discourse
overlay        1966080 581268 1384812   30% /var/lib/docker/overlay2/7bcf2b5b249bfee2779029a08bd5aa382280fa7f215bab5ff92f2262b191ff88/merged
shm             255538      2  255536    1% /var/lib/docker/containers/fa74729955fc712235de719c91122424db26fa473d4ac4b460fc8426059dedac/shm
tmpfs           255538     11  255527    1% /run/user/0

(Chris Croome) #11

Sorry the above is wrong, the disks are not running out of inodes (/var/lib/docker is only at 29% and this isn’t changing), it appears that the limit is the number of postgres processes — they keeps growing until the 500 errors start, at around 300 and is then reset by a ./launcher restart app.


(Matt Palmer) #12

With postgres going gangbusters, it’s probably best to start with a survey of all those processes, to try and figure out what they’re up to. A ps ax | grep postgres: will give you a start, although it’ll only tell if the processes are idle, or doing a SELECT, or UPDATE, or INSERT, or whatever.

Assuming that they’re not all sitting around idle, to get a complete picture of what the queries are doing, you’ll need to enter the container (./launcher enter app), bring up the psql console (sudo -u postgres psql), and get the complete running query data (select * from pg_stat_activity). You’re looking for queries that have been running for aaaaaaaaages (because they could be holding locks that everyone else needs to proceed), and lots of queries doing the exact same thing. I lean towards the former, myself, because of the high iowait.

You can see which process is doing all that I/O with iotop, btw (not usually installed by default, but very handy and worth having).


(Chris Croome) #13

It’s just serving 500 errors again now.

ps ax | grep postgres | wc -l
105

These are almost all:

28761 ?        Ss     0:00 postgres: 9.5/main: discourse discourse [local] SELECT

In the app the following doesn’t return anything at all:

sudo -u postgres psql
select * from pg_stat_activity

I’m going to restart the app and can pick this up again tomorrow, thanks for the suggestions.


(Matt Palmer) #14

Gah, sorry, there’s supposed to be a semicolon at the end of that select, as in select * from pg_stat_activity;.


(Chris Croome) #15

Thanks @mpalmer, this is what I get (this is before another ./launcher restart app), I couldn’t post it here as it makes this comment longer than the character limit.

I also started generating another Munin graph last night and this makes it clear that the culprit is Postgres:


(Chris Croome) #16

I rebooted the virtual server (it had been up 100 days and there have been several kernel updates since then) and now the /admin page loads and reveals that the site is up to date and running v1.9.0.beta17 +38.

After the reboot quite a few emails were sent out (relating to posts from last night and today), none had been sent today before the reboot.


(Chris Croome) #17

So, unless the problem occurs again it appears that perhaps it was somehow related to the Docker version and Linux kernel version — after a reboot the issue appear to have been solved: