I recently tried to update Discourse via the Docker Manager plugin’s interface but have run into some problems.
Initially, Discourse was running but there were updates available, and it had been a month or two since I’d updated. First I updated Docker from Docker Manager, which worked fine. Then I tried to update Discourse. I believe the log opened with a message about bundle
being out-of-date, but eventually the entire site became unresponsive to web requests and the Docker Manager web interface stopped responding too, so I couldn’t tell whether something had gone wrong from there. I decided to just update the system and restart everything to start fresh.
Unfortunately this did not fix the issue, as the site became unresponsive again. It wasn’t immediate though. When it came back up I believe I tried to update some of the other plugins, but after it stopped responding again I decided to just rebuild the app.
This is about where I am now. I’ve tried some other things but they haven’t yielded much success. With the latest rebuilt/restarted app, the site works momentarily (topics are accessible, posts can be written, etc.), but after a minute or two the site stops responding. The server health looks fine as far as I can tell. Here is a summary during the brief period of availability for the web interface.
top - 15:48:40 up 7:36, 1 user, load average: 1.94, 0.80, 0.38
Tasks: 139 total, 3 running, 136 sleeping, 0 stopped, 0 zombie
%Cpu(s): 60.0 us, 12.5 sy, 0.0 ni, 25.0 id, 1.2 wa, 0.0 hi, 1.2 si, 0.0 st
MiB Mem : 2000.180 total, 77.324 free, 1003.438 used, 919.418 buff/cache
MiB Swap: 2047.996 total, 2010.887 free, 37.109 used. 719.566 avail Mem
And here it is after the site has become unresponsive.
top - 15:52:05 up 7:40, 1 user, load average: 0.61, 0.70, 0.42
Tasks: 143 total, 2 running, 141 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.7 us, 0.8 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
MiB Mem : 2000.180 total, 131.539 free, 1021.273 used, 847.367 buff/cache
MiB Swap: 2047.996 total, 2003.777 free, 44.219 used. 660.539 avail Mem
Here’s a ps -aux
dump where my username is “tmpname”:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 37912 2528 ? Ss 08:11 0:04 /sbin/init
root 2 0.0 0.0 0 0 ? S 08:11 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S 08:11 0:01 [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S< 08:11 0:00 [kworker/0:0H]
root 7 0.0 0.0 0 0 ? S 08:11 0:14 [rcu_sched]
root 8 0.0 0.0 0 0 ? S 08:11 0:00 [rcu_bh]
root 9 0.0 0.0 0 0 ? S 08:11 0:00 [migration/0]
root 10 0.0 0.0 0 0 ? S 08:11 0:00 [watchdog/0]
root 11 0.0 0.0 0 0 ? S 08:11 0:00 [watchdog/1]
root 12 0.0 0.0 0 0 ? S 08:11 0:00 [migration/1]
root 13 0.0 0.0 0 0 ? S 08:11 0:01 [ksoftirqd/1]
root 15 0.0 0.0 0 0 ? S< 08:11 0:00 [kworker/1:0H]
root 16 0.0 0.0 0 0 ? S 08:11 0:00 [kdevtmpfs]
root 17 0.0 0.0 0 0 ? S< 08:11 0:00 [netns]
root 18 0.0 0.0 0 0 ? S< 08:11 0:00 [perf]
root 19 0.0 0.0 0 0 ? S 08:11 0:00 [khungtaskd]
root 20 0.0 0.0 0 0 ? S< 08:11 0:00 [writeback]
root 21 0.0 0.0 0 0 ? SN 08:11 0:00 [ksmd]
root 22 0.0 0.0 0 0 ? SN 08:11 0:04 [khugepaged]
root 23 0.0 0.0 0 0 ? S< 08:11 0:00 [crypto]
root 24 0.0 0.0 0 0 ? S< 08:11 0:00 [kintegrityd]
root 25 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 26 0.0 0.0 0 0 ? S< 08:11 0:00 [kblockd]
root 27 0.0 0.0 0 0 ? S< 08:11 0:00 [ata_sff]
root 28 0.0 0.0 0 0 ? S< 08:11 0:00 [md]
root 29 0.0 0.0 0 0 ? S< 08:11 0:00 [devfreq_wq]
root 33 0.1 0.0 0 0 ? S 08:11 0:28 [kswapd0]
root 34 0.0 0.0 0 0 ? S< 08:11 0:00 [vmstat]
root 35 0.0 0.0 0 0 ? S 08:11 0:00 [fsnotify_mark]
root 36 0.0 0.0 0 0 ? S 08:11 0:00 [ecryptfs-kthrea]
root 52 0.0 0.0 0 0 ? S< 08:11 0:00 [kthrotld]
root 53 0.0 0.0 0 0 ? S< 08:11 0:00 [acpi_thermal_pm]
root 54 0.0 0.0 0 0 ? S 08:11 0:00 [vballoon]
root 55 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 56 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 57 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 58 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 59 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 60 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 61 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 62 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 63 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 64 0.0 0.0 0 0 ? S 08:11 0:00 [scsi_eh_0]
root 65 0.0 0.0 0 0 ? S< 08:11 0:00 [scsi_tmf_0]
root 66 0.0 0.0 0 0 ? S 08:11 0:00 [scsi_eh_1]
root 67 0.0 0.0 0 0 ? S< 08:11 0:00 [scsi_tmf_1]
root 73 0.0 0.0 0 0 ? S< 08:11 0:00 [ipv6_addrconf]
root 86 0.0 0.0 0 0 ? S< 08:11 0:00 [deferwq]
root 87 0.0 0.0 0 0 ? S< 08:11 0:00 [charger_manager]
root 124 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 125 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 126 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 127 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 128 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 129 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 130 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 131 0.0 0.0 0 0 ? S< 08:11 0:00 [bioset]
root 132 0.0 0.0 0 0 ? S 08:11 0:00 [scsi_eh_2]
root 133 0.0 0.0 0 0 ? S< 08:11 0:00 [kpsmoused]
root 134 0.0 0.0 0 0 ? S< 08:11 0:00 [scsi_tmf_2]
root 135 0.0 0.0 0 0 ? S< 08:11 0:00 [ttm_swap]
root 136 0.0 0.0 0 0 ? S< 08:11 0:00 [qxl_gc]
root 413 0.0 0.0 0 0 ? S 08:11 0:02 [jbd2/vda1-8]
root 414 0.0 0.0 0 0 ? S< 08:11 0:00 [ext4-rsv-conver]
root 455 0.0 0.0 0 0 ? S< 08:11 0:00 [kworker/0:1H]
root 456 0.0 0.1 29648 2772 ? Ss 08:11 0:05 /lib/systemd/systemd-journald
root 479 0.0 0.0 0 0 ? S 08:11 0:00 [kauditd]
root 495 0.0 0.0 44432 1412 ? Ss 08:11 0:00 /lib/systemd/systemd-udevd
root 532 0.0 0.0 0 0 ? S< 08:11 0:00 [kworker/1:1H]
systemd+ 627 0.0 0.0 100324 388 ? Ssl 08:11 0:00 /lib/systemd/systemd-timesyncd
root 750 0.0 0.0 0 0 ? S< 08:11 0:00 [kvm-irqfd-clean]
root 991 0.0 0.0 4400 80 ? Ss 08:12 0:00 /usr/sbin/acpid
root 997 0.0 0.0 65520 816 ? Ss 08:12 0:01 /usr/sbin/sshd -D
root 1004 0.0 0.0 29880 220 ? Ss 08:12 0:00 /sbin/cgmanager -m name=systemd
syslog 1008 0.0 0.0 256396 84 ? Ssl 08:12 0:01 /usr/sbin/rsyslogd -n
root 1012 0.0 0.0 275872 1456 ? Ssl 08:12 0:02 /usr/lib/accountsservice/accounts-daemon
root 1015 0.0 0.0 29008 580 ? Ss 08:12 0:00 /usr/sbin/cron -f
message+ 1020 0.0 0.0 42900 856 ? Ss 08:12 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activati
daemon 1030 0.0 0.0 26044 196 ? Ss 08:12 0:00 /usr/sbin/atd -f
root 1031 0.0 0.0 28620 1784 ? Ss 08:12 0:00 /lib/systemd/systemd-logind
root 1038 0.5 1.3 530876 27212 ? Ssl 08:12 2:24 /usr/bin/dockerd -H fd://
root 1068 0.0 0.0 19472 104 ? Ss 08:12 0:01 /usr/sbin/irqbalance --pid=/var/run/irqbalance.pid
root 1075 0.0 0.0 277088 696 ? Ssl 08:12 0:00 /usr/lib/policykit-1/polkitd --no-debug
root 1091 0.0 0.0 15940 144 tty1 Ss+ 08:12 0:00 /sbin/agetty --noclear tty1 linux
root 1095 0.1 0.6 383184 13932 ? Ssl 08:12 0:52 docker-containerd --config /var/run/docker/containerd/containerd.toml
root 14580 0.0 0.0 0 0 ? S 15:00 0:01 [kworker/u4:1]
root 18499 0.0 0.0 0 0 ? S 15:07 0:00 [kworker/1:3]
root 20792 0.0 0.1 99512 3544 ? Ss 15:35 0:00 sshd: tmpname [priv]
tmpname 20812 0.0 0.1 45248 2196 ? Ss 15:36 0:00 /lib/systemd/systemd --user
root 20814 0.0 0.0 0 0 ? S 15:36 0:00 [kworker/0:1]
tmpname 20816 0.0 0.0 63424 1156 ? S 15:36 0:00 (sd-pam)
root 20818 0.0 0.0 0 0 ? S 15:36 0:00 [kworker/1:1]
tmpname 20924 0.0 0.1 99512 2788 ? S 15:36 0:00 sshd: tmpname@pts/0
tmpname 20925 0.0 0.2 24324 4964 pts/0 Ss 15:36 0:00 -bash
root 20957 0.0 0.1 56836 2212 pts/0 S 15:36 0:00 sudo bash
root 20958 0.0 0.2 24176 5636 pts/0 S 15:36 0:00 bash
root 21690 0.0 0.0 0 0 ? S 15:45 0:00 [kworker/0:2]
root 21797 0.0 0.0 0 0 ? S 15:45 0:00 [kworker/u4:0]
root 23189 0.0 0.0 50872 1548 ? Sl 15:47 0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 443 -container-ip 172.17.0.
root 23201 0.0 0.2 50872 5624 ? Sl 15:47 0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 80 -container-ip 172.17.0.2
root 23207 0.0 0.2 233780 5256 ? Sl 15:47 0:00 docker-containerd-shim --namespace moby --workdir /var/lib/docker/containerd/daemon/io.c
root 23224 0.0 0.0 21288 1220 pts/0 Ss+ 15:47 0:00 /bin/bash /sbin/boot
root 23521 0.0 0.0 4396 940 pts/0 S+ 15:47 0:00 /usr/bin/runsvdir -P /etc/service
root 23522 0.0 0.0 4244 400 ? Ss 15:47 0:00 runsv cron
root 23523 0.0 0.0 4244 404 ? Ss 15:47 0:00 runsv nginx
root 23524 0.0 0.0 4244 452 ? Ss 15:47 0:00 runsv postgres
root 23525 0.0 0.0 4244 416 ? Ss 15:47 0:00 runsv unicorn
root 23526 0.0 0.0 4244 372 ? Ss 15:47 0:00 runsv redis
root 23527 0.0 0.1 82724 3284 ? S 15:47 0:00 nginx: master process /usr/sbin/nginx
root 23528 0.0 0.0 29300 1232 ? S 15:47 0:00 cron -f
root 23529 0.0 0.0 4244 404 ? Ss 15:47 0:00 runsv rsyslog
systemd+ 23530 0.0 0.7 295044 15104 ? S 15:47 0:00 /usr/lib/postgresql/9.5/bin/postmaster -D /etc/postgresql/9.5/main
systemd+ 23531 0.0 0.0 182664 1240 ? Sl 15:47 0:00 rsyslogd -n
_apt 23533 2.1 10.9 261612 224668 ? Sl 15:47 0:06 /usr/bin/redis-server *:6379
www-data 23543 0.1 0.2 84500 5228 ? S 15:47 0:00 nginx: worker process
www-data 23544 0.2 0.2 84444 5064 ? S 15:47 0:00 nginx: worker process
www-data 23545 0.0 0.0 82892 1640 ? S 15:47 0:00 nginx: cache manager process
systemd+ 23549 0.0 0.2 295148 4600 ? Ss 15:47 0:00 postgres: 9.5/main: checkpointer process
systemd+ 23550 0.0 0.6 295044 13636 ? Ss 15:47 0:00 postgres: 9.5/main: writer process
systemd+ 23551 0.0 0.3 295044 6432 ? Ss 15:47 0:00 postgres: 9.5/main: wal writer process
systemd+ 23552 0.0 0.1 295472 3620 ? Ss 15:47 0:00 postgres: 9.5/main: autovacuum launcher process
systemd+ 23553 0.0 0.1 150348 2368 ? Ss 15:47 0:00 postgres: 9.5/main: stats collector process
tmpname 23684 0.2 0.1 29900 4092 ? S 15:47 0:00 /bin/bash config/unicorn_launcher -E production -c config/unicorn.conf.rb
tmpname 23689 6.3 7.9 472664 163784 ? Sl 15:47 0:17 unicorn master -E production -c config/unicorn.conf.rb
tmpname 23780 0.8 8.5 516740 174604 ? Sl 15:48 0:02 sidekiq 5.0.5 discourse [0 of 5 busy]
systemd+ 23787 11.8 7.3 315760 149940 ? Ss 15:48 0:29 postgres: 9.5/main: discourse discourse [local] UPDATE
systemd+ 24034 0.2 2.6 309200 54172 ? Ss 15:48 0:00 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+ 24068 0.8 7.2 313480 149440 ? Ss 15:48 0:02 postgres: 9.5/main: discourse discourse [local] idle
systemd+ 24295 0.4 7.2 309404 149224 ? Ss 15:49 0:00 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+ 24402 0.0 2.3 308764 48828 ? Ss 15:49 0:00 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+ 24693 0.6 6.7 307392 139276 ? Ss 15:50 0:00 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+ 24710 1.2 7.8 325556 160096 ? Ss 15:50 0:01 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+ 24904 0.0 1.0 302932 20832 ? Ss 15:51 0:00 postgres: 9.5/main: discourse discourse [local] UPDATE waiting
systemd+ 25045 0.0 1.1 303176 23092 ? Ss 15:51 0:00 postgres: 9.5/main: discourse discourse [local] UPDATE waiting
systemd+ 25210 0.7 3.6 305964 74032 ? Ss 15:51 0:00 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+ 25213 0.0 0.9 303100 19800 ? Ss 15:51 0:00 postgres: 9.5/main: discourse discourse [local] idle
tmpname 25218 22.5 14.6 1688000 300012 ? Sl 15:51 0:07 unicorn worker[0] -E production -c config/unicorn.conf.rb
systemd+ 25345 0.4 2.0 306012 41032 ? Ss 15:51 0:00 postgres: 9.5/main: discourse discourse [local] UPDATE waiting
tmpname 25378 101 14.4 1679772 295916 ? Sl 15:52 0:06 unicorn worker[1] -E production -c config/unicorn.conf.rb
root 25394 0.0 0.0 0 0 ? S 15:52 0:00 [kworker/u4:2]
tmpname 25494 0.0 0.1 20148 3480 ? S 15:52 0:00 sleep 1
root 25495 0.0 0.1 37364 3216 pts/0 R+ 15:52 0:00 ps -aux
I’ve tried rebuilding with a couple different configurations. One of the first things I tried was just disabling all plugins in containers/app.yml
. I tried this with the latest Discourse version as well as in combination with specifying an older Discourse version to use in the same file. Unfortunately, while I think both have some effect prolonging the life of the web interface to a half-hour or an hour, they don’t solve the problem. However, I’ve noticed that with one version of Discourse from October, the interface would work, but topics would be inaccessible, so I’m not sure how reliable this information is. Interface lifetime with the default configuration is somewhat variable anyways, but tends to be from 1 to 5 minutes.
Discourse admin logs on web interface don’t show any fatal errors in the time it’s up, but I’d love to know how to check from within the container proper if it would help.
Does anyone have any idea what may be happening or what else I could throw at the wall to try and solve it? If more info would be useful I’d be happy to provide it.