Discourse stops responding after the nightly backup has run


(Chris Croome) #1

A large Discourse server I have been managing had not been great since upgrading it to v2.0.0.beta1 +18, every few days I’m having to ./launcher restart app as users are getting HTTP 500 errors so was thinking that downgrading it to the stable tag, v1.9.0 might solve this, in addition I’d rather the site be on stable versions rather than beta versions as less frequent updates would mean less work.

I’m afraid I haven’t found any documentation of a thread on this site that explains how to do this, sorry if I’m missing something obvious.


(Sam Saffron) #2

Sorry, but downgrading 2.0 to 1.9 stable is not supported and never will be.

You are going to need to post more information about why stuff is crashing and get it sorted.


(Jeff Atwood) #3

Any weird third party plugins? That would be my first port of call.


(Matt Palmer) #4

Check your logs. That’ll tell you what the error is, and then it can be fixed.


(Chris Croome) #5

OK, thanks, I did wonder if I was asking the impossible, when a stable 2.x version is out I’ll look at switching to that, in the meantime I’m happy to post whatever information helps get to the bottom of this.

Looking at the logs, docker logs $(docker ps -q) | less, since I last restarted the app these lines look like they might indicate some issues:

fail: nginx: runsv not running
[Thu 11 Jan 07:52:45 UTC 2018] ESC[1;31;40mReload error for :ESC[0m
Started runsvdir, PID is 248
sh: echo: I/O error
rsyslogd: command 'KLogPermitNonKernelFacility' is currently not permitted - did you already set it via a RainerScript command (v6+ config)? [v8.16.0 tr
y http://www.rsyslog.com/e/2222 ]


260:M 11 Jan 07:52:45.815 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of
 128.
260:M 11 Jan 07:52:45.816 # Server started, Redis version 3.0.6
260:M 11 Jan 07:52:45.816 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overco
mmit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
260:M 11 Jan 07:52:45.816 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issu
es with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in
 order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
rsyslogd: imklog: cannot open kernel log (/proc/kmsg): Operation not permitted.
rsyslogd: activation of module imklog failed [v8.16.0 try http://www.rsyslog.com/e/2145 ]
rsyslogd: Could not open output pipe '/dev/xconsole':: No such file or directory [v8.16.0 try http://www.rsyslog.com/e/2039 ]


config/unicorn_launcher: line 47:  6636 Segmentation fault      (core dumped) sleep 1
config/unicorn_launcher: line 47:  6637 Segmentation fault      (core dumped) sleep 1
config/unicorn_launcher: line 47:  6638 Segmentation fault      (core dumped) sleep 1
config/unicorn_launcher: line 47:  6639 Segmentation fault      (core dumped) sleep 1
config/unicorn_launcher: line 47:  6640 Segmentation fault      (core dumped) sleep 1
config/unicorn_launcher: line 47:  6641 Segmentation fault      (core dumped) sleep 1
config/unicorn_launcher: fork: Cannot allocate memory
config/unicorn_launcher: line 9: wait_for: No record of process 6641
exiting


config/unicorn_launcher: line 47:  6852 Segmentation fault      (core dumped) sleep 1
config/unicorn_launcher: line 47:  6853 Segmentation fault      (core dumped) sleep 1


config/unicorn_launcher: line 47:  6854 Segmentation fault      (core dumped) sleep 1


config/unicorn_launcher: line 47:  6855 Segmentation fault      (core dumped) sleep 1
config/unicorn_launcher: line 47:  6857 Segmentation fault      (core dumped) sleep 1



config/unicorn_launcher: line 47:  6861 Segmentation fault      (core dumped) sleep 1
config/unicorn_launcher: line 47:  6862 Segmentation fault      (core dumped) sleep 1
config/unicorn_launcher: fork: Cannot allocate memory

The last line in the log before the restart was from 2018-01-11 03:36:54 and related to pg_dump, I wonder if the problem could be that the server doesn’t have enough resources to run the nightly backup?

The backups do look OK, but I haven’t tried restoring one, they are 1 GB in size, these Munin graphs from the server show some patterns, the last line in the log file before I restarted the container corresponds with a significant drop in disk iops:

One of the two CPU’s being fully utilized:

In terms of memory things look OK to me, the server has 4GB of RAM and no swap:

Where else can I look for clues for the cause of this problem?


(Matt Palmer) #6

Yeeeeeeah… that’s not going to be doing you any good.


(Sam Saffron) #7

We don’t support downgrades due to migrations, once migrations run we do not guarantee they can run backwards.

Regarding your issues, do you have swap configured? Can you configure a 2G-4G swap for extra breathing room.


(Chris Croome) #8

Thanks a lot, I’ll add a swap disk, the VPS has SSDs and no swap at the moment.


(Chris Croome) #9

The only plugin installed is this one:


(Chris Croome) #10

The server is using up to 1.35G of swap so I think we are going to increase the RAM from 4GB to 6GB to see if that helps.
memory-day


(Chris Croome) #11

Adding swap has resulted in the server staying up, postgres memory usage peaks at almost 6GB of RAM:

multips_memory-day

But there is still this issue with CPU usage, I was thinking that it appears to start at the same time as the backup (just after 1am) and results on one CPU (the server has two) being fully used until a ./launcher restart app:

load-day

cpu-day

It is clear from running top in the container that the cause is a ruby process:

top - 12:34:37 up 7 days, 23:20,  0 users,  load average: 1.17, 1.38, 1.51
Tasks:  38 total,   1 running,  37 sleeping,   0 stopped,   0 zombie
%Cpu0  :  8.4 us,  1.0 sy,  0.0 ni, 88.3 id,  2.3 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 97.0 us,  0.3 sy,  0.0 ni,  2.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  4050324 total,   332792 free,  1311252 used,  2406280 buff/cache
KiB Swap:  4194300 total,  3746352 free,   447948 used.  1439200 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
  371 discour+  20   0 2084316 275476  13232 S  99.3  6.8 693:51.12 ruby        
 2096 postgres  20   0 1218460 996268 993172 S   4.7 24.6   0:18.66 postmaster  
 2260 discour+  20   0 1666948 261492  24440 S   1.0  6.5   0:06.93 ruby        
  256 discour+  20   0   29896   1720   1552 S   0.3  0.0   0:53.44 unicorn_la+ 
  258 redis     20   0  146924  27584   1652 S   0.3  0.7   5:05.11 redis-serv+ 
    1 root      20   0   21288     52     52 S   0.0  0.0   0:00.04 boot        
  249 root      20   0    4396     20      0 S   0.0  0.0   0:01.22 runsvdir    

But loking at the logs, docker logs $(docker ps -q) | less, it appears that the backup is starting at 3:36am:

2018-01-13 03:41:04.824 UTC [15499] discourse@discourse LOG:  duration: 1197.431 ms  statement: UPDATE topic_search_data
                                           SET
                                              raw_data = 'Backup completed successfully The backup was successful. Visit the admin > backup section to download your new backup. Here’s the log: [2018-01-13 03:36:16] [STARTED] [2018-01-13 03:36:16] ''system'' has started',
                                              locale = 'en',
                                              search_data = TO_TSVECTOR('english', 'Backup completed successfully The backup was successful. Visit the admin > backup section to download your new backup. Here’s the log: [2018-01-13 03:36:16] [STARTED] [2018-01-13 03:36:16] ''system'' has started'),
                                              version = 1
                                           WHERE topic_id = 150720
2018-01-13 03:41:05.018 UTC [15499] discourse@discourse LOG:  duration: 193.542 ms  statement: INSERT INTO topic_search_data
                            (topic_id, search_data, locale, raw_data, version)
                            VALUES (150720, TO_TSVECTOR('english', 'Backup completed successfully The backup was successful. Visit the admin > backup section to download your new backup. Here’s the log: [2018-01-13 03:36:16] [STARTED] [2018-01-13 03:36:16] ''system'' has started'), 'en', 'Backup completed successfully The backup was successful. Visit the admin > backup section to download your new backup. Here’s the log: [2018-01-13 03:36:16] [STARTED] [2018-01-13 03:36:16] ''system'' has started', 1)

Any suggestions regsrding what I can do to get to the bottom of this, I’m tempted to set a crontab to ./launcher app restart for 5am every day but ideally a better solution would be found :slight_smile:


(Sam Saffron) #12

need more details about which process is taking high amount of cpu, try ps aux when it happens or htop


(misterdude) #13

I have been seeing this same thing. I restored from a backup and am running 1.9.

The server would run fine for about a day and then memory and swap would slowly increase until Discourse eventually crashed. A simple restart fixed it until the issue creeped back.


(Chris Croome) #14

The htop output (sorry I couldn’t copy it as plain text):

Trying to get some more information regarding process 15650 using strace:

date ; strace -p 15650 ; date
Sun 14 Jan 13:19:14 GMT 2018
strace: Process 15650 attached
select(22, [21], NULL, NULL, NULL
^Cstrace: Process 15650 detached
 <detached ...>
Sun 14 Jan 13:20:00 GMT 2018

That process doesn’t appear to be doing much considering the resources it is using…

And process 15672:

clear ; date ; strace -p 15672 ; date
Sun 14 Jan 13:21:07 GMT 2018
strace: Process 15672 attached
strace: [ Process PID=15672 runs in x32 mode. ]
strace: [ Process PID=15672 runs in 64 bit mode. ]
futex(0x7f1304c48644, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1304c48640, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f1304c48678, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f1304c4867c, FUTEX_WAIT_PRIVATE, 658187, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1304c48640, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x7f1304c4867c, FUTEX_WAIT_PRIVATE, 658189, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661370, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661373, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1304c48640, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x7f1304c4867c, FUTEX_WAIT_PRIVATE, 658191, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661380, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1304c48640, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x7f1304c4867c, FUTEX_WAIT_PRIVATE, 658193, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661386, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661389, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1304c48640, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x7f1304c4867c, FUTEX_WAIT_PRIVATE, 658195, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661394, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661397, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661401, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661403, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661405, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1304c48640, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x7f1304c4867c, FUTEX_WAIT_PRIVATE, 658197, NULL) = 0
write(6, "!", 1)                        = 1
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661409, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1304c48640, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x7f1304c4867c, FUTEX_WAIT_PRIVATE, 658199, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661414, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661417, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661419, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1304c48640, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x7f1304c4867c, FUTEX_WAIT_PRIVATE, 658201, NULL) = 0
write(6, "!", 1)                        = 1
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661423, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661425, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1304c48640, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x7f1304c4867c, FUTEX_WAIT_PRIVATE, 658203, NULL) = 0
write(6, "!", 1)                        = 1
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661429, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1304c48640, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x7f1304c4867c, FUTEX_WAIT_PRIVATE, 658205, NULL) = 0
write(6, "!", 1)                        = 1
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661433, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1304c48640, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f1304c48678, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f1304c4867c, FUTEX_WAIT_PRIVATE, 658207, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
sched_yield()                           = 0
sched_yield()                           = 0
futex(0x7f1304c48644, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1304c48640, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x7f1304c4867c, FUTEX_WAIT_PRIVATE, 658209, NULL) = 0
write(6, "!", 1)                        = 1
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661439, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1304c48640, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x7f1304c4867c, FUTEX_WAIT_PRIVATE, 658211, NULL) = 0
write(6, "!", 1)                        = 1
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661443, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1304c48640, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
futex(0x7f1304c4867c, FUTEX_WAIT_PRIVATE, 658213, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661448, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661451, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f1304c48644, FUTEX_WAIT_PRIVATE, 2661455, NULL) = 0
futex(0x7f1304c48610, FUTEX_WAKE_PRIVATE, 1) = 0
^Cstrace: Process 15672 detached
Sun 14 Jan 13:21:08 GMT 2018

As you can see the above is the output from one second.

The output of ps -aux:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  57192  2964 ?        Ss   Jan05   0:10 /sbin/init
root         2  0.0  0.0      0     0 ?        S    Jan05   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Jan05   0:09 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S<   Jan05   0:00 [kworker/0:0H]
root         7  0.0  0.0      0     0 ?        S    Jan05  10:26 [rcu_sched]
root         8  0.0  0.0      0     0 ?        S    Jan05   0:00 [rcu_bh]
root         9  0.0  0.0      0     0 ?        S    Jan05   0:09 [migration/0]
root        10  0.0  0.0      0     0 ?        S<   Jan05   0:00 [lru-add-drain]
root        11  0.0  0.0      0     0 ?        S    Jan05   0:02 [watchdog/0]
root        12  0.0  0.0      0     0 ?        S    Jan05   0:00 [cpuhp/0]
root        13  0.0  0.0      0     0 ?        S    Jan05   0:00 [cpuhp/1]
root        14  0.0  0.0      0     0 ?        S    Jan05   0:02 [watchdog/1]
root        15  0.0  0.0      0     0 ?        S    Jan05   0:08 [migration/1]
root        16  0.0  0.0      0     0 ?        S    Jan05   0:18 [ksoftirqd/1]
root        18  0.0  0.0      0     0 ?        S<   Jan05   0:00 [kworker/1:0H]
root        19  0.0  0.0      0     0 ?        S    Jan05   0:00 [kdevtmpfs]
root        20  0.0  0.0      0     0 ?        S<   Jan05   0:00 [netns]
root        21  0.0  0.0      0     0 ?        S    Jan05   0:00 [khungtaskd]
root        22  0.0  0.0      0     0 ?        S    Jan05   0:00 [oom_reaper]
root        23  0.0  0.0      0     0 ?        S<   Jan05   0:00 [writeback]
root        24  0.0  0.0      0     0 ?        S    Jan05   0:00 [kcompactd0]
root        26  0.0  0.0      0     0 ?        SN   Jan05   0:00 [ksmd]
root        28  0.0  0.0      0     0 ?        S<   Jan05   0:00 [crypto]
root        29  0.0  0.0      0     0 ?        S<   Jan05   0:00 [kintegrityd]
root        30  0.0  0.0      0     0 ?        S<   Jan05   0:00 [bioset]
root        31  0.0  0.0      0     0 ?        S<   Jan05   0:00 [kblockd]
root        32  0.0  0.0      0     0 ?        S<   Jan05   0:00 [devfreq_wq]
root        33  0.0  0.0      0     0 ?        S<   Jan05   0:00 [watchdogd]
root        35  0.2  0.0      0     0 ?        S    Jan05  33:05 [kswapd0]
root        36  0.0  0.0      0     0 ?        S<   Jan05   0:00 [vmstat]
root        48  0.0  0.0      0     0 ?        S<   Jan05   0:00 [kthrotld]
root        49  0.0  0.0      0     0 ?        S<   Jan05   0:00 [ipv6_addrconf]
root        84  0.0  0.0      0     0 ?        S<   Jan05   0:00 [ata_sff]
root        85  0.0  0.0      0     0 ?        S    Jan05   0:00 [scsi_eh_0]
root        87  0.0  0.0      0     0 ?        S<   Jan05   0:00 [scsi_tmf_0]
root        88  0.0  0.0      0     0 ?        S    Jan05   0:00 [scsi_eh_1]
root        89  0.0  0.0      0     0 ?        S<   Jan05   0:00 [scsi_tmf_1]
root        92  0.0  0.0      0     0 ?        S<   Jan05   0:00 [bioset]
root        93  0.0  0.0      0     0 ?        S<   Jan05   0:00 [bioset]
root        96  0.0  0.0      0     0 ?        S<   Jan05   0:13 [kworker/0:1H]
root       102  0.0  0.0      0     0 ?        S<   Jan05   0:00 [md]
root       128  0.0  0.0      0     0 ?        S<   Jan05   0:00 [raid5wq]
root       190  0.0  0.0      0     0 ?        S<   Jan05   0:08 [kworker/1:1H]
root       443  0.0  0.0      0     0 ?        S    Jan05   0:23 [jbd2/vda2-8]
root       444  0.0  0.0      0     0 ?        S<   Jan05   0:00 [ext4-rsv-conver]
root       484  0.0  0.0  59356  1832 ?        Ss   Jan05   1:16 /lib/systemd/systemd-journald
root       494  0.0  0.0      0     0 ?        S    Jan05   0:00 [kauditd]
root       502  0.0  0.0 107200    92 ?        Ss   Jan05   0:00 /sbin/lvmetad -f
root       505  0.0  0.0      0     0 ?        S<   Jan05   0:00 [bond0]
root       512  0.0  0.0  45924   748 ?        Ss   Jan05   0:01 /lib/systemd/systemd-udevd
root       549  0.0  0.0      0     0 ?        S<   Jan05   0:00 [ttm_swap]
root       578  0.0  0.0      0     0 ?        S    Jan05   0:00 [jbd2/vda1-8]
root       579  0.0  0.0      0     0 ?        S<   Jan05   0:00 [ext4-rsv-conver]
root       612  0.0  0.0  49872    76 ?        Ss   Jan05   0:01 /sbin/rpcbind -f -w
root       622  0.0  0.0 250116   584 ?        Ssl  Jan05   0:21 /usr/sbin/rsyslogd -n
root       623  0.0  0.0   4216     0 ?        Ss   Jan05   0:00 /usr/sbin/acpid
message+   624  0.0  0.0  45132   764 ?        Ss   Jan05   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root       649  0.0  0.0  46500  2032 ?        Ss   Jan05   0:01 /lib/systemd/systemd-logind
root       651  0.0  0.0  35808   100 ?        Ss   Jan05   0:24 /usr/sbin/irqbalance --foreground
root       654  0.0  0.0  29664   360 ?        Ss   Jan05   0:03 /usr/sbin/cron -f
root       871  0.3  0.3 486620 15572 ?        Ssl  Jan05  39:08 /usr/bin/dockerd -H fd://
nagios     877  0.1  0.0 1126232 2304 ?        Ssl  Jan05  15:24 /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon -e /var/log/icinga2/error.log
root       878  0.0  0.0  14536    32 tty1     Ss+  Jan05   0:00 /sbin/agetty --noclear tty1 linux
root       879  0.0  0.0  14312    40 ttyS0    Ss+  Jan05   0:00 /sbin/agetty --keep-baud 115200,38400,9600 ttyS0 vt220
_chrony    887  0.0  0.0 107924   696 ?        S    Jan05   0:01 /usr/sbin/chronyd
root       907  0.0  0.0  69944   496 ?        Ss   Jan05   0:16 /usr/sbin/sshd -D
nagios     923  0.0  0.0 511748  1016 ?        S    Jan05   2:06 /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon -e /var/log/icinga2/error.log
root      1120  0.1  0.1 319304  4772 ?        Ssl  Jan05  19:09 docker-containerd --config /var/run/docker/containerd/containerd.toml
root      1166  0.0  0.2 294532  8912 ?        Sl   Jan05   3:19 /usr/bin/python3 /usr/bin/fail2ban-server -s /var/run/fail2ban/fail2ban.sock -p /var/run/fail2ban/fail2ban.pid -x -b
root      1230  0.0  0.0  81164   336 ?        Ss   Jan05   0:09 /usr/lib/postfix/sbin/master -w
postfix   1233  0.0  0.0  83392   720 ?        S    Jan05   0:02 qmgr -l -t unix -u
root      2474  0.0  0.0 317084   484 ?        Ssl  Jan05   0:04 /usr/lib/packagekit/packagekitd
root      2479  0.0  0.0 286204   352 ?        Ssl  Jan05   0:00 /usr/lib/policykit-1/polkitd --no-debug
postfix   3255  0.0  0.0  87740   840 ?        S    Jan05   0:01 tlsmgr -l -t unix -u -c
root      5895  0.0  0.0  55992  2480 ?        Ss   Jan12   0:11 /usr/bin/perl -wT /usr/sbin/munin-node
root      8192  0.0  0.0  50120   828 ?        S    01:58   0:00 ssh backup.agile.coop borg serve --umask=077
root      8197  0.0  0.0 235120  1724 ?        Sl   01:58   0:00 /usr/local/bin/borg mount backup.agile.coop:backup::2018-01-14 /media/backup
ntp       8212  0.0  1.0 1223580 41764 ?       Ss   01:58   0:00 postgres: 9.5/main: discourse discourse [local] idle
discour+  8247  0.3  4.8 1742228 194432 ?      Sl   Jan13   2:41 unicorn worker[1] -E production -c config/unicorn.conf.rb
discour+  8266  0.3  4.5 1720200 184212 ?      Sl   Jan13   2:48 unicorn worker[0] -E production -c config/unicorn.conf.rb
postfix  11160  0.0  0.0  83228   556 ?        S    12:11   0:00 pickup -l -t unix -u -c
root     14082  0.0  0.0      0     0 ?        S    Jan13   0:25 [kworker/1:0]
root     14635  0.0  0.0      0     0 ?        S    06:29   0:13 [kworker/1:2]
root     15205  0.0  0.0  50788   428 ?        Sl   Jan13   0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 443 -container-ip 172.17.0.2 -container-port 443
root     15217  0.0  0.0 181860   536 ?        Sl   Jan13   0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 80 -container-ip 172.17.0.2 -container-port 80
root     15223  0.0  0.0   8568    52 ?        Sl   Jan13   0:05 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/f05e2f390ad1a175a1593438e87e6aac78a4d0cfb9838454372592741ce69da0 -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc
root     15237  0.0  0.0  21288    44 pts/0    Ss+  Jan13   0:00 /bin/bash /sbin/boot
root     15528  0.0  0.0   4396    20 pts/0    S+   Jan13   0:00 /usr/bin/runsvdir -P /etc/service
root     15529  0.0  0.0   4244    24 ?        Ss   Jan13   0:00 runsv rsyslog
root     15530  0.0  0.0   4244    24 ?        Ss   Jan13   0:00 runsv cron
root     15531  0.0  0.0   4244    24 ?        Ss   Jan13   0:00 runsv redis
root     15532  0.0  0.0   4244     4 ?        Ss   Jan13   0:00 runsv unicorn
root     15533  0.0  0.0   4244    12 ?        Ss   Jan13   0:00 runsv nginx
root     15534  0.0  0.0   4244    24 ?        Ss   Jan13   0:00 runsv postgres
munin    15535  0.3  0.6 151020 26700 ?        Sl   Jan13   3:04 /usr/bin/redis-server *:6379
ntp      15536  0.0  0.6 1208964 24344 ?       S    Jan13   0:03 /usr/lib/postgresql/9.5/bin/postmaster -D /etc/postgresql/9.5/main
root     15537  0.0  0.0  82728   332 ?        S    Jan13   0:00 nginx: master process /usr/sbin/nginx
discour+ 15538  0.0  0.0  29896  1808 ?        S    Jan13   0:32 /bin/bash config/unicorn_launcher -E production -c config/unicorn.conf.rb
sshd     15539  0.0  0.0 182664   628 ?        Sl   Jan13   0:00 rsyslogd -n
root     15541  0.0  0.0  29300   968 ?        S    Jan13   0:00 cron -f
www-data 15551  0.0  0.0  84540  3512 ?        S    Jan13   0:36 nginx: worker process
www-data 15552  0.0  0.1  84344  4268 ?        S    Jan13   0:27 nginx: worker process
www-data 15553  0.0  0.0  82868   932 ?        S    Jan13   0:00 nginx: cache manager process
discour+ 15556  0.1  2.3 480236 95252 ?        Sl   Jan13   1:26 unicorn master -E production -c config/unicorn.conf.rb
ntp      15559  0.0 14.7 1209096 598104 ?      Ss   Jan13   0:08 postgres: 9.5/main: checkpointer process   
ntp      15560  0.0  3.0 1208964 122404 ?      Ss   Jan13   0:02 postgres: 9.5/main: writer process   
ntp      15561  0.0  0.4 1208964 17244 ?       Ss   Jan13   0:14 postgres: 9.5/main: wal writer process   
ntp      15562  0.0  0.0 1209392 3372 ?        Ss   Jan13   0:00 postgres: 9.5/main: autovacuum launcher process   
ntp      15563  0.0  0.0 150348  1824 ?        Ss   Jan13   0:06 postgres: 9.5/main: stats collector process   
discour+ 15650 67.4  9.8 2069468 397084 ?      Sl   Jan13 685:22 sidekiq 5.0.5 discourse [3 of 5 busy]
discour+ 15760  0.4  4.7 1736116 191912 ?      Sl   Jan13   4:10 unicorn worker[2] -E production -c config/unicorn.conf.rb
discour+ 15818  0.4  4.6 1717168 188396 ?      Sl   Jan13   4:14 unicorn worker[3] -E production -c config/unicorn.conf.rb
root     18917  0.0  0.0  56436  1136 ?        Ss   Jan08   0:00 /lib/systemd/systemd --user
root     18918  0.0  0.0  84688    88 ?        S    Jan08   0:00 (sd-pam)
root     20427  0.0  0.0      0     0 ?        S    12:47   0:00 [kworker/u4:0]
root     23219  0.0  0.0      0     0 ?        S    12:58   0:00 [kworker/0:0]
root     25512  0.0  0.0  27984     0 ?        Ss   Jan08   3:07 SCREEN -S docker-logs
root     25513  0.0  0.0  19948    24 pts/2    Ss   Jan08   0:00 /bin/bash
root     25525  0.0  0.0  55024     0 pts/2    S    Jan08   0:00 su - discourse
discour+ 25526  0.0  0.0  21332    44 pts/2    S+   Jan08   0:00 -su
root     28036  0.0  0.1  99340  5260 ?        Ss   13:14   0:00 sshd: root@pts/0
root     28046  0.0  0.1  21324  4724 pts/0    Ss   13:14   0:00 -bash
root     28301  0.0  0.0      0     0 ?        S    13:15   0:00 [kworker/u4:1]
ntp      29308 10.0 26.2 1241432 1061812 ?     Ss   13:18   0:38 postgres: 9.5/main: discourse discourse [local] idle
root     29429  0.0  0.0      0     0 ?        S    13:19   0:00 [kworker/0:1]
ntp      30334  0.0  1.8 1218904 75112 ?       Ss   13:21   0:00 postgres: 9.5/main: discourse discourse [local] idle
ntp      30335  0.0  1.4 1218432 57732 ?       Ss   13:21   0:00 postgres: 9.5/main: discourse discourse [local] idle
ntp      30382  0.0  1.7 1218820 69828 ?       Ss   13:22   0:00 postgres: 9.5/main: discourse discourse [local] idle
ntp      30427  0.1  2.0 1220224 82288 ?       Ss   13:22   0:00 postgres: 9.5/main: discourse discourse [local] idle
ntp      30446  0.1  1.9 1219008 78452 ?       Ss   13:22   0:00 postgres: 9.5/main: discourse discourse [local] idle
ntp      30447  0.0  1.6 1218984 68036 ?       Ss   13:22   0:00 postgres: 9.5/main: discourse discourse [local] idle
ntp      30458  0.4 15.7 1220004 638376 ?      Ss   13:22   0:00 postgres: 9.5/main: discourse discourse [local] idle
ntp      30466  0.1  1.2 1218172 52436 ?       Ss   13:22   0:00 postgres: 9.5/main: discourse discourse [local] idle
ntp      30473  0.1  1.2 1218172 49020 ?       Ss   13:22   0:00 postgres: 9.5/main: discourse discourse [local] idle
ntp      30546  0.0  1.7 1219068 72532 ?       Ss   13:23   0:00 postgres: 9.5/main: discourse discourse [local] idle
root     30757  0.0  0.0      0     0 ?        S    13:24   0:00 [kworker/u4:2]
postfix  30867  0.0  0.1  83228  6312 ?        S    13:24   0:00 showq -t unix -u -c
root     30894  0.0  0.0      0     0 ?        S    13:24   0:00 [kworker/0:2]
discour+ 30903  0.0  0.0  20148  1396 ?        S    13:24   0:00 sleep 1
root     30906  0.0  0.0  38304  3156 pts/0    R+   13:24   0:00 ps -aux
root     30907  0.0  0.0  21324  2012 pts/0    R+   13:24   0:00 -bash

(Sam Saffron) #15

The issue there is that sidekiq our job processor is using high CPU.

Can you help determining if it is a scheduled job that is causing it.

Head to: sitename.com/sidekiq/scheduler, is there a job that is not finishing or taking really long? You can trigger it to see if it causes the issue.

Rebuild with RBTRACE=1 in env.

When sidekiq is eating 100% cpu run:

./launcher enter app
gem install rbtrace
rbtrace -p SIDEKIQ_PID --firehose

Where is it stuck?


(Chris Croome) #16

Thanks @sam I’ll follow this up when it starts doing it again (I restarted the container first thing this morning, before I read your comment above).


(Chris Croome) #17

Thanks, I didn’t know about the web interface to Sidekiq at /sidekiq/, the jobs that didn’t have finished dates were Jobs::EnsureDbConsistency (last run 8 hours ago) and Jobs::CategoryStats (last run 22 hours ago) — so I suspect that this is caused by Jobs::CategoryStats, this is a big site with over a million posts imported from a vBulletin site, but I should know for sure tomorrow morning.


(Chris Croome) #18

The only job that looks like it could be the issue was a email to root generated by a backup crontab, I have added a MAILTO="" env var to the crontab and restarted the Docker container and will see how things look tomorrow morning.


(Chris Croome) #19

So, I’m 99% sure that this issue was the same as the ones on the Is it safe to roll Docker from 18.01 back to 17.10? thread and the Discourse crashed after backup - http 500 thread — the server was updated to Docker version 17.12.0-ce, build c97c6d6 on 2017-12-28 and initially, I believe both the Discourse backup and the non-Discourse backup job on the server were triggering the Docker issue and now both have been addressed (I needed to ensure email from cron was sent to root@localhost or an external address rather than simply root as email to root was being passed to Discourse) we have had a night without any issues, so this has been resolved as far as I’m concerned. Thanks for all the help in trying to find the cause :slight_smile:


(Chris Croome) #20

I’m afraid that although the issue with excessive CPU usage has been resolved the server still stopped responding at around 4am this morning, as indicated in these graphs:

interrupts-dayentropy-daydocker_cpu-daycpu-dayload-daydiskstats_throughput-daydiskstats_iops-daymultips_memory-day

Looking at the way in which the Postgres memory usage started to flatline I’m, guessing this could be an indication of the cause of the problem.

The last log entries before the restart are from the backup at 3:42:

2018-01-18 03:42:15.031 UTC [7555] discourse@discourse LOG:  duration: 108.316 ms  statement: UPDATE post_search_data
                                           SET
                                              raw_data = 'The backup was successful. Visit the admin > backup section to download your new backup. Here’s the log: [2018-01-18 03:38:04] [STARTED] [2018-01-18 03:38:04] ''system'' has started the backup!
... (too many lines to post here)

My guess is that this indicates that the backup job is causing Postgres to stop responding.

This is a site with a million posts imported from vBulletin and it has 4GB of RAM — is the problem that it is under resourced?