Discourse web interface becomes unresponsive a few minutes after starting

I recently tried to update Discourse via the Docker Manager plugin’s interface but have run into some problems.

Initially, Discourse was running but there were updates available, and it had been a month or two since I’d updated. First I updated Docker from Docker Manager, which worked fine. Then I tried to update Discourse. I believe the log opened with a message about bundle being out-of-date, but eventually the entire site became unresponsive to web requests and the Docker Manager web interface stopped responding too, so I couldn’t tell whether something had gone wrong from there. I decided to just update the system and restart everything to start fresh.

Unfortunately this did not fix the issue, as the site became unresponsive again. It wasn’t immediate though. When it came back up I believe I tried to update some of the other plugins, but after it stopped responding again I decided to just rebuild the app.

This is about where I am now. I’ve tried some other things but they haven’t yielded much success. With the latest rebuilt/restarted app, the site works momentarily (topics are accessible, posts can be written, etc.), but after a minute or two the site stops responding. The server health looks fine as far as I can tell. Here is a summary during the brief period of availability for the web interface.

top - 15:48:40 up  7:36,  1 user,  load average: 1.94, 0.80, 0.38
Tasks: 139 total,   3 running, 136 sleeping,   0 stopped,   0 zombie
%Cpu(s): 60.0 us, 12.5 sy,  0.0 ni, 25.0 id,  1.2 wa,  0.0 hi,  1.2 si,  0.0 st
MiB Mem : 2000.180 total,   77.324 free, 1003.438 used,  919.418 buff/cache
MiB Swap: 2047.996 total, 2010.887 free,   37.109 used.  719.566 avail Mem

And here it is after the site has become unresponsive.

top - 15:52:05 up  7:40,  1 user,  load average: 0.61, 0.70, 0.42
Tasks: 143 total,   2 running, 141 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.7 us,  0.8 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem : 2000.180 total,  131.539 free, 1021.273 used,  847.367 buff/cache
MiB Swap: 2047.996 total, 2003.777 free,   44.219 used.  660.539 avail Mem

Here’s a ps -aux dump where my username is “tmpname”:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1  37912  2528 ?        Ss   08:11   0:04 /sbin/init
root         2  0.0  0.0      0     0 ?        S    08:11   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    08:11   0:01 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S<   08:11   0:00 [kworker/0:0H]
root         7  0.0  0.0      0     0 ?        S    08:11   0:14 [rcu_sched]
root         8  0.0  0.0      0     0 ?        S    08:11   0:00 [rcu_bh]
root         9  0.0  0.0      0     0 ?        S    08:11   0:00 [migration/0]
root        10  0.0  0.0      0     0 ?        S    08:11   0:00 [watchdog/0]
root        11  0.0  0.0      0     0 ?        S    08:11   0:00 [watchdog/1]
root        12  0.0  0.0      0     0 ?        S    08:11   0:00 [migration/1]
root        13  0.0  0.0      0     0 ?        S    08:11   0:01 [ksoftirqd/1]
root        15  0.0  0.0      0     0 ?        S<   08:11   0:00 [kworker/1:0H]
root        16  0.0  0.0      0     0 ?        S    08:11   0:00 [kdevtmpfs]
root        17  0.0  0.0      0     0 ?        S<   08:11   0:00 [netns]
root        18  0.0  0.0      0     0 ?        S<   08:11   0:00 [perf]
root        19  0.0  0.0      0     0 ?        S    08:11   0:00 [khungtaskd]
root        20  0.0  0.0      0     0 ?        S<   08:11   0:00 [writeback]
root        21  0.0  0.0      0     0 ?        SN   08:11   0:00 [ksmd]
root        22  0.0  0.0      0     0 ?        SN   08:11   0:04 [khugepaged]
root        23  0.0  0.0      0     0 ?        S<   08:11   0:00 [crypto]
root        24  0.0  0.0      0     0 ?        S<   08:11   0:00 [kintegrityd]
root        25  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        26  0.0  0.0      0     0 ?        S<   08:11   0:00 [kblockd]
root        27  0.0  0.0      0     0 ?        S<   08:11   0:00 [ata_sff]
root        28  0.0  0.0      0     0 ?        S<   08:11   0:00 [md]
root        29  0.0  0.0      0     0 ?        S<   08:11   0:00 [devfreq_wq]
root        33  0.1  0.0      0     0 ?        S    08:11   0:28 [kswapd0]
root        34  0.0  0.0      0     0 ?        S<   08:11   0:00 [vmstat]
root        35  0.0  0.0      0     0 ?        S    08:11   0:00 [fsnotify_mark]
root        36  0.0  0.0      0     0 ?        S    08:11   0:00 [ecryptfs-kthrea]
root        52  0.0  0.0      0     0 ?        S<   08:11   0:00 [kthrotld]
root        53  0.0  0.0      0     0 ?        S<   08:11   0:00 [acpi_thermal_pm]
root        54  0.0  0.0      0     0 ?        S    08:11   0:00 [vballoon]
root        55  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        56  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        57  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        58  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        59  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        60  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        61  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        62  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        63  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        64  0.0  0.0      0     0 ?        S    08:11   0:00 [scsi_eh_0]
root        65  0.0  0.0      0     0 ?        S<   08:11   0:00 [scsi_tmf_0]
root        66  0.0  0.0      0     0 ?        S    08:11   0:00 [scsi_eh_1]
root        67  0.0  0.0      0     0 ?        S<   08:11   0:00 [scsi_tmf_1]
root        73  0.0  0.0      0     0 ?        S<   08:11   0:00 [ipv6_addrconf]
root        86  0.0  0.0      0     0 ?        S<   08:11   0:00 [deferwq]
root        87  0.0  0.0      0     0 ?        S<   08:11   0:00 [charger_manager]
root       124  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       125  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       126  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       127  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       128  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       129  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       130  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       131  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       132  0.0  0.0      0     0 ?        S    08:11   0:00 [scsi_eh_2]
root       133  0.0  0.0      0     0 ?        S<   08:11   0:00 [kpsmoused]
root       134  0.0  0.0      0     0 ?        S<   08:11   0:00 [scsi_tmf_2]
root       135  0.0  0.0      0     0 ?        S<   08:11   0:00 [ttm_swap]
root       136  0.0  0.0      0     0 ?        S<   08:11   0:00 [qxl_gc]
root       413  0.0  0.0      0     0 ?        S    08:11   0:02 [jbd2/vda1-8]
root       414  0.0  0.0      0     0 ?        S<   08:11   0:00 [ext4-rsv-conver]
root       455  0.0  0.0      0     0 ?        S<   08:11   0:00 [kworker/0:1H]
root       456  0.0  0.1  29648  2772 ?        Ss   08:11   0:05 /lib/systemd/systemd-journald
root       479  0.0  0.0      0     0 ?        S    08:11   0:00 [kauditd]
root       495  0.0  0.0  44432  1412 ?        Ss   08:11   0:00 /lib/systemd/systemd-udevd
root       532  0.0  0.0      0     0 ?        S<   08:11   0:00 [kworker/1:1H]
systemd+   627  0.0  0.0 100324   388 ?        Ssl  08:11   0:00 /lib/systemd/systemd-timesyncd
root       750  0.0  0.0      0     0 ?        S<   08:11   0:00 [kvm-irqfd-clean]
root       991  0.0  0.0   4400    80 ?        Ss   08:12   0:00 /usr/sbin/acpid
root       997  0.0  0.0  65520   816 ?        Ss   08:12   0:01 /usr/sbin/sshd -D
root      1004  0.0  0.0  29880   220 ?        Ss   08:12   0:00 /sbin/cgmanager -m name=systemd
syslog    1008  0.0  0.0 256396    84 ?        Ssl  08:12   0:01 /usr/sbin/rsyslogd -n
root      1012  0.0  0.0 275872  1456 ?        Ssl  08:12   0:02 /usr/lib/accountsservice/accounts-daemon
root      1015  0.0  0.0  29008   580 ?        Ss   08:12   0:00 /usr/sbin/cron -f
message+  1020  0.0  0.0  42900   856 ?        Ss   08:12   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activati
daemon    1030  0.0  0.0  26044   196 ?        Ss   08:12   0:00 /usr/sbin/atd -f
root      1031  0.0  0.0  28620  1784 ?        Ss   08:12   0:00 /lib/systemd/systemd-logind
root      1038  0.5  1.3 530876 27212 ?        Ssl  08:12   2:24 /usr/bin/dockerd -H fd://
root      1068  0.0  0.0  19472   104 ?        Ss   08:12   0:01 /usr/sbin/irqbalance --pid=/var/run/irqbalance.pid
root      1075  0.0  0.0 277088   696 ?        Ssl  08:12   0:00 /usr/lib/policykit-1/polkitd --no-debug
root      1091  0.0  0.0  15940   144 tty1     Ss+  08:12   0:00 /sbin/agetty --noclear tty1 linux
root      1095  0.1  0.6 383184 13932 ?        Ssl  08:12   0:52 docker-containerd --config /var/run/docker/containerd/containerd.toml
root     14580  0.0  0.0      0     0 ?        S    15:00   0:01 [kworker/u4:1]
root     18499  0.0  0.0      0     0 ?        S    15:07   0:00 [kworker/1:3]
root     20792  0.0  0.1  99512  3544 ?        Ss   15:35   0:00 sshd: tmpname [priv]
tmpname  20812  0.0  0.1  45248  2196 ?        Ss   15:36   0:00 /lib/systemd/systemd --user
root     20814  0.0  0.0      0     0 ?        S    15:36   0:00 [kworker/0:1]
tmpname  20816  0.0  0.0  63424  1156 ?        S    15:36   0:00 (sd-pam)
root     20818  0.0  0.0      0     0 ?        S    15:36   0:00 [kworker/1:1]
tmpname  20924  0.0  0.1  99512  2788 ?        S    15:36   0:00 sshd: tmpname@pts/0
tmpname  20925  0.0  0.2  24324  4964 pts/0    Ss   15:36   0:00 -bash
root     20957  0.0  0.1  56836  2212 pts/0    S    15:36   0:00 sudo bash
root     20958  0.0  0.2  24176  5636 pts/0    S    15:36   0:00 bash
root     21690  0.0  0.0      0     0 ?        S    15:45   0:00 [kworker/0:2]
root     21797  0.0  0.0      0     0 ?        S    15:45   0:00 [kworker/u4:0]
root     23189  0.0  0.0  50872  1548 ?        Sl   15:47   0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 443 -container-ip 172.17.0.
root     23201  0.0  0.2  50872  5624 ?        Sl   15:47   0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 80 -container-ip 172.17.0.2
root     23207  0.0  0.2 233780  5256 ?        Sl   15:47   0:00 docker-containerd-shim --namespace moby --workdir /var/lib/docker/containerd/daemon/io.c
root     23224  0.0  0.0  21288  1220 pts/0    Ss+  15:47   0:00 /bin/bash /sbin/boot
root     23521  0.0  0.0   4396   940 pts/0    S+   15:47   0:00 /usr/bin/runsvdir -P /etc/service
root     23522  0.0  0.0   4244   400 ?        Ss   15:47   0:00 runsv cron
root     23523  0.0  0.0   4244   404 ?        Ss   15:47   0:00 runsv nginx
root     23524  0.0  0.0   4244   452 ?        Ss   15:47   0:00 runsv postgres
root     23525  0.0  0.0   4244   416 ?        Ss   15:47   0:00 runsv unicorn
root     23526  0.0  0.0   4244   372 ?        Ss   15:47   0:00 runsv redis
root     23527  0.0  0.1  82724  3284 ?        S    15:47   0:00 nginx: master process /usr/sbin/nginx
root     23528  0.0  0.0  29300  1232 ?        S    15:47   0:00 cron -f
root     23529  0.0  0.0   4244   404 ?        Ss   15:47   0:00 runsv rsyslog
systemd+ 23530  0.0  0.7 295044 15104 ?        S    15:47   0:00 /usr/lib/postgresql/9.5/bin/postmaster -D /etc/postgresql/9.5/main
systemd+ 23531  0.0  0.0 182664  1240 ?        Sl   15:47   0:00 rsyslogd -n
_apt     23533  2.1 10.9 261612 224668 ?       Sl   15:47   0:06 /usr/bin/redis-server *:6379
www-data 23543  0.1  0.2  84500  5228 ?        S    15:47   0:00 nginx: worker process
www-data 23544  0.2  0.2  84444  5064 ?        S    15:47   0:00 nginx: worker process
www-data 23545  0.0  0.0  82892  1640 ?        S    15:47   0:00 nginx: cache manager process
systemd+ 23549  0.0  0.2 295148  4600 ?        Ss   15:47   0:00 postgres: 9.5/main: checkpointer process
systemd+ 23550  0.0  0.6 295044 13636 ?        Ss   15:47   0:00 postgres: 9.5/main: writer process
systemd+ 23551  0.0  0.3 295044  6432 ?        Ss   15:47   0:00 postgres: 9.5/main: wal writer process
systemd+ 23552  0.0  0.1 295472  3620 ?        Ss   15:47   0:00 postgres: 9.5/main: autovacuum launcher process
systemd+ 23553  0.0  0.1 150348  2368 ?        Ss   15:47   0:00 postgres: 9.5/main: stats collector process
tmpname  23684  0.2  0.1  29900  4092 ?        S    15:47   0:00 /bin/bash config/unicorn_launcher -E production -c config/unicorn.conf.rb
tmpname  23689  6.3  7.9 472664 163784 ?       Sl   15:47   0:17 unicorn master -E production -c config/unicorn.conf.rb
tmpname  23780  0.8  8.5 516740 174604 ?       Sl   15:48   0:02 sidekiq 5.0.5 discourse [0 of 5 busy]
systemd+ 23787 11.8  7.3 315760 149940 ?       Ss   15:48   0:29 postgres: 9.5/main: discourse discourse [local] UPDATE
systemd+ 24034  0.2  2.6 309200 54172 ?        Ss   15:48   0:00 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+ 24068  0.8  7.2 313480 149440 ?       Ss   15:48   0:02 postgres: 9.5/main: discourse discourse [local] idle
systemd+ 24295  0.4  7.2 309404 149224 ?       Ss   15:49   0:00 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+ 24402  0.0  2.3 308764 48828 ?        Ss   15:49   0:00 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+ 24693  0.6  6.7 307392 139276 ?       Ss   15:50   0:00 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+ 24710  1.2  7.8 325556 160096 ?       Ss   15:50   0:01 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+ 24904  0.0  1.0 302932 20832 ?        Ss   15:51   0:00 postgres: 9.5/main: discourse discourse [local] UPDATE waiting
systemd+ 25045  0.0  1.1 303176 23092 ?        Ss   15:51   0:00 postgres: 9.5/main: discourse discourse [local] UPDATE waiting
systemd+ 25210  0.7  3.6 305964 74032 ?        Ss   15:51   0:00 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+ 25213  0.0  0.9 303100 19800 ?        Ss   15:51   0:00 postgres: 9.5/main: discourse discourse [local] idle
tmpname  25218 22.5 14.6 1688000 300012 ?      Sl   15:51   0:07 unicorn worker[0] -E production -c config/unicorn.conf.rb
systemd+ 25345  0.4  2.0 306012 41032 ?        Ss   15:51   0:00 postgres: 9.5/main: discourse discourse [local] UPDATE waiting
tmpname  25378  101 14.4 1679772 295916 ?      Sl   15:52   0:06 unicorn worker[1] -E production -c config/unicorn.conf.rb
root     25394  0.0  0.0      0     0 ?        S    15:52   0:00 [kworker/u4:2]
tmpname  25494  0.0  0.1  20148  3480 ?        S    15:52   0:00 sleep 1
root     25495  0.0  0.1  37364  3216 pts/0    R+   15:52   0:00 ps -aux

I’ve tried rebuilding with a couple different configurations. One of the first things I tried was just disabling all plugins in containers/app.yml. I tried this with the latest Discourse version as well as in combination with specifying an older Discourse version to use in the same file. Unfortunately, while I think both have some effect prolonging the life of the web interface to a half-hour or an hour, they don’t solve the problem. However, I’ve noticed that with one version of Discourse from October, the interface would work, but topics would be inaccessible, so I’m not sure how reliable this information is. Interface lifetime with the default configuration is somewhat variable anyways, but tends to be from 1 to 5 minutes.

Discourse admin logs on web interface don’t show any fatal errors in the time it’s up, but I’d love to know how to check from within the container proper if it would help.

Does anyone have any idea what may be happening or what else I could throw at the wall to try and solve it? If more info would be useful I’d be happy to provide it.

Initial questions that come to mind:

  1. What does “unresponsive” mean? There’s a lot of detail hiding behind that one word, a lot of diagnostically relevant. So far, you’re telling the doctor “I don’t feel well”…
  2. What’s your exact setup? Is this a machine configured only to run Discourse, in line with our official installation instructions, or some other arrangement?
  3. What do the nginx and unicorn logs say? They’re all in /var/discourse/shared/app/log (var-log/nginx and rails, respectively).
3 Likes

My bad. When trying to navigate to the site, you get the “504 Gateway Time-out” error page from nginx. When navigating the site when it fails, you get the following message.

image

The machine is dedicated to running Discourse. It was installed in accordance with the official instructions. It runs on DigitalOcean NYC2 with the 2GB memory/2 CPU spec. The only major configuration changes that come to mind aside from those enabling HTTPS (done a long time ago) are the plugin specifications in the app.yml file. Here are the ones we’ve been using.

## Plugins go here
## see https://meta.discourse.org/t/19157 for details
hooks:
  after_code:
    - exec:
        cd: $home/plugins
        cmd:
          - mkdir -p plugins
          - git clone https://github.com/discourse/docker_manager.git
          - git clone https://github.com/discourse/discourse-spoiler-alert.git
          - git clone https://github.com/discourse/discourse-bbcode-color.git
          - git clone https://github.com/kasperpeulen/discourse-mathjax.git
          - git clone https://github.com/discourse/discourse-data-explorer.git
          - git clone https://github.com/discourse/discourse-staff-notes.git
          - git clone https://github.com/angusmcleod/discourse-quick-messages.git
          - git clone https://github.com/davidtaylorhq/discourse-whos-online.git
          - git clone https://github.com/discourse/discourse-cakeday.git

As mentioned, I’ve tried rebuilding the app with all of these lines removed, and also with only Docker Manager and Spoiler Alert removed, to no effect.

The following seems to be a repeated pattern in the unicorn logs (production.log only; production_errors.log is empty). If you want the full log I’d prefer to PM it since it has some user info.

Redis::CannotConnectError (Error connecting to Redis on localhost:6379 (Redis::TimeoutError))
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:345:in `rescue in establish_connection'
Started GET "/" for 68.180.103.128 at 2017-12-22 19:29:52 +0000
Failed to warm up pretty text: Connection timed out
Job exception: Connection timed out

Job exception: Connection timed out

Failed to process job: Connection timed out ["/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/connection/hiredis.rb:58:in `rescue in read'", "/var/www
/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/connection/hiredis.rb:54:in `read'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/c
lient.rb:262:in `block in read'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:250:in `io'", "/var/www/discourse/vendor/bundle/ruby/2.4.
0/gems/redis-3.3.5/lib/redis/client.rb:261:in `read'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:120:in `block in call'", "/var/www/d
iscourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:231:in `block (2 levels) in process'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/l
ib/redis/client.rb:367:in `ensure_connected'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:221:in `block in process'", "/var/www/discou
rse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:306:in `logging'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:220:in
 `process'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:120:in `call'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/
lib/redis.rb:2399:in `block in _eval'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis.rb:58:in `block in synchronize'", "/usr/local/lib/ruby/2.4.0/
monitor.rb:214:in `mon_synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis.rb:58:in `synchronize'", "/var/www/discourse/vendor/bundle/ruby
/2.4.0/gems/redis-3.3.5/lib/redis.rb:2398:in `_eval'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis.rb:2450:in `evalsha'", "/var/www/discourse/ven
dor/bundle/ruby/2.4.0/gems/message_bus-2.1.1/lib/message_bus/backends/redis.rb:380:in `cached_eval'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/message_bus-2.1.1/lib
/message_bus/backends/redis.rb:140:in `publish'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/message_bus-2.1.1/lib/message_bus.rb:248:in `publish'", "/var/www/discour
se/vendor/bundle/ruby/2.4.0/gems/message_bus-2.1.1/lib/message_bus.rb:485:in `block in new_subscriber_thread'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/message_bus
-2.1.1/lib/message_bus/timer_thread.rb:102:in `do_work'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/message_bus-2.1.1/lib/message_bus/timer_thread.rb:80:in `block (2
 levels) in queue'"]
Connection timed out subscribe failed, reconnecting in 1 second. Call stack ["/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/connection/hiredis.rb:58
:in `rescue in read'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/connection/hiredis.rb:54:in `read'", "/var/www/discourse/vendor/bundle/ruby/2.
4.0/gems/redis-3.3.5/lib/redis/client.rb:262:in `block in read'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:250:in `io'", "/var/www/d
iscourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:261:in `read'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:136:
in `block (3 levels) in call_loop'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:135:in `loop'", "/var/www/discourse/vendor/bundle/ruby
/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:135:in `block (2 levels) in call_loop'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:231:in
 `block (2 levels) in process'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:367:in `ensure_connected'", "/var/www/discourse/vendor/bun
dle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:221:in `block in process'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:306:in `log
ging'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:220:in `process'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/li
b/redis/client.rb:134:in `block in call_loop'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:280:in `with_socket_timeout'", "/var/www/di
scourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:133:in `call_loop'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/subscribe.
rb:43:in `subscription'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/subscribe.rb:12:in `subscribe'", "/var/www/discourse/vendor/bundle/ruby/2.4
.0/gems/redis-3.3.5/lib/redis.rb:2775:in `_subscription'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis.rb:2143:in `block in subscribe'", "/var/ww
w/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis.rb:58:in `block in synchronize'", "/usr/local/lib/ruby/2.4.0/monitor.rb:214:in `mon_synchronize'", "/var/www/
discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis.rb:58:in `synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis.rb:2142:in `su
bscribe'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/message_bus-2.1.1/lib/message_bus/backends/redis.rb:336:in `global_subscribe'", "/var/www/discourse/vendor/bundl
e/ruby/2.4.0/gems/message_bus-2.1.1/lib/message_bus.rb:525:in `global_subscribe_thread'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/message_bus-2.1.1/lib/message_bus
.rb:473:in `block in new_subscriber_thread'"]
Started GET "/service-worker.js" for 68.180.103.128 at 2017-12-22 19:30:12 +0000
Processing by StaticController#service_worker_asset as JS
Job exception: Error connecting to Redis on localhost:6379 (Redis::TimeoutError)

Redis::CannotConnectError (Error connecting to Redis on localhost:6379 (Redis::TimeoutError))
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:345:in `rescue in establish_connection'
Completed 500 Internal Server Error in 14015ms
Redis::TimeoutError (Connection timed out)
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/connection/hiredis.rb:58:in `rescue in read'
Job exception: Connection timed out

As for the nginx logs, here is the first part of error.log. The rest of it is just repetition of this as far as I can tell. I’ve redacted client IPs and our domain name.

2017/12/22 07:42:19 [notice] 261#261: signal process started
2017/12/22 07:42:20 [error] 286#286: *1 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-bu
s/48e4a9b37eca4c8698b4fc139e65b5b3/poll HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/48e4a9b37eca4c8698b4fc139e65b5b3/poll", host: "fake.site", referrer: "https
://fake.site/"
2017/12/22 07:42:20 [error] 285#285: *4 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-b
us/c0f4b4033bc54e689e81eb3c63888cad/poll HTTP/1.1", upstream: "http://127.0.0.1:3000/message-bus/c0f4b4033bc54e689e81eb3c63888cad/poll", host: "fake.site", referrer: "http
s://fake.site/"
2017/12/22 07:42:26 [error] 285#285: *7 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /logs/repor
t_js_error HTTP/2.0", upstream: "http://127.0.0.1:3000/logs/report_js_error", host: "fake.site", referrer: "https://fake.site/admin"
2017/12/22 07:42:26 [error] 285#285: *7 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-bu
s/8353172cce0b429b9add2a2ec91a48b7/poll?dlp=t HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/8353172cce0b429b9add2a2ec91a48b7/poll?dlp=t", host: "fake.site", refe
rrer: "https://fake.site/admin/upgrade"
2017/12/22 07:42:26 [error] 285#285: *7 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-bu
s/6c6e9448d24b433d876b332dfdf096db/poll HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/6c6e9448d24b433d876b332dfdf096db/poll", host: "fake.site", referrer: "https
://fake.site/admin"
2017/12/22 07:42:26 [error] 285#285: *7 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-bu
s/6c6e9448d24b433d876b332dfdf096db/poll HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/6c6e9448d24b433d876b332dfdf096db/poll", host: "fake.site", referrer: "https
://fake.site/admin"
2017/12/22 07:42:30 [error] 286#286: *12 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-bu
s/0192d8e00b5c4e14af162a50f562973a/poll HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/0192d8e00b5c4e14af162a50f562973a/poll", host: "fake.site", referrer: "https
://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13831"
2017/12/22 07:42:30 [error] 286#286: *14 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-b
us/27d5d4160f034009bf58909902edeb75/poll HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/27d5d4160f034009bf58909902edeb75/poll", host: "fake.site", referrer: "http
s://fake.site/"
2017/12/22 07:42:31 [error] 286#286: *16 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-bu
s/8fc1f427309446a584eff0c0b8db0cb4/poll HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/8fc1f427309446a584eff0c0b8db0cb4/poll", host: "fake.site", referrer: "https
://fake.site/"
2017/12/22 07:42:31 [error] 285#285: *18 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-bu
s/982c031dc1f3453baac29d2bcaa85c9b/poll HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/982c031dc1f3453baac29d2bcaa85c9b/poll", host: "fake.site", referrer: "https
://fake.site/"
2017/12/22 07:42:31 [error] 285#285: *20 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-b
us/9e0d3cb0cc1249b39a6e2c8d07489ff2/poll HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/9e0d3cb0cc1249b39a6e2c8d07489ff2/poll", host: "fake.site", referrer: "http
s://fake.site/t/dream-thread/201/438"
2017/12/22 07:42:31 [error] 286#286: *21 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-b
us/f053c65c9e3e4c16ae8e464b97f1c406/poll HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/f053c65c9e3e4c16ae8e464b97f1c406/poll", host: "fake.site", referrer: "http
s://fake.site/"
2017/12/22 07:42:31 [error] 285#285: *24 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-bu
s/2889e7e348a9493c85b98087035c8f43/poll?dlp=t HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/2889e7e348a9493c85b98087035c8f43/poll?dlp=t", host: "fake.site", refe
rrer: "https://fake.site/"
2017/12/22 07:42:32 [error] 285#285: *4 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-b
us/c0f4b4033bc54e689e81eb3c63888cad/poll HTTP/1.1", upstream: "http://127.0.0.1:3000/message-bus/c0f4b4033bc54e689e81eb3c63888cad/poll", host: "fake.site", referrer: "http
s://fake.site/"
2017/12/22 07:42:32 [error] 285#285: *27 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-b
us/1bf6fc182a8a484a9146dd895960a4f9/poll?dlp=t HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/1bf6fc182a8a484a9146dd895960a4f9/poll?dlp=t", host: "fake.site", ref
errer: "https://fake.site/t/nje/3023"
2017/12/22 07:42:33 [error] 286#286: *29 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-
bus/eb40ccef79474e01847f663ed97c5a67/poll?dlp=t HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/eb40ccef79474e01847f663ed97c5a67/poll?dlp=t", host: "fake.site", re
ferrer: "https://fake.site/"
2017/12/22 07:42:33 [error] 286#286: *29 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-
bus/d24384e025bd44f287cfee96e318cd4d/poll?dlp=t HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/d24384e025bd44f287cfee96e318cd4d/poll?dlp=t", host: "fake.site", re
ferrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13826"
2017/12/22 07:42:33 [error] 285#285: *32 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-b
us/d0506689e0f3431184850f16d2ef8612/poll?dlp=t HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/d0506689e0f3431184850f16d2ef8612/poll?dlp=t", host: "fake.site", ref
errer: "https://fake.site/t/who-are-you/146/33866"
2017/12/22 07:42:33 [error] 286#286: *33 connect() failed (111: Connection refused) while connecting to upstream, client: [redacted], server: _, request: "POST /message-b
us/48e4a9b37eca4c8698b4fc139e65b5b3/poll HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/48e4a9b37eca4c8698b4fc139e65b5b3/poll", host: "fake.site", referrer: "http
s://fake.site/"
2017/12/22 07:46:07 [error] 285#285: *18 upstream prematurely closed connection while reading response header from upstream, client: [redacted], server: _, request: "GET /
t/hot-drink-thread/3577/93 HTTP/2.0", upstream: "http://127.0.0.1:3000/t/hot-drink-thread/3577/93", host: "fake.site", referrer: "https://fake.site/"
2017/12/22 07:46:53 [error] 285#285: *159 upstream prematurely closed connection while reading response header from upstream, client: [redacted], server: _, request: "GE
T /t/off-derpic-xi-the-undiscovered-derpic/3422/13831 HTTP/2.0", upstream: "http://127.0.0.1:3000/t/off-derpic-xi-the-undiscovered-derpic/3422/13831", host: "fake.site"
2017/12/22 07:47:40 [error] 285#285: *32 upstream prematurely closed connection while reading response header from upstream, client: [redacted], server: _, request: "POST
 /posts HTTP/2.0", upstream: "http://127.0.0.1:3000/posts", host: "fake.site", referrer: "https://fake.site/t/dream-thread/201/445"
2017/12/22 07:47:49 [error] 286#286: *12 upstream timed out (110: Connection timed out) while reading response header from upstream, client: [redacted], server: _, request
: "GET /latest.json?_=1513928645399 HTTP/2.0", upstream: "http://127.0.0.1:3000/latest.json?_=1513928645399", host: "fake.site", referrer: "https://fake.site/"
2017/12/22 07:47:50 [error] 285#285: *32 upstream timed out (110: Connection timed out) while reading response header from upstream, client: [redacted], server: _, reques
t: "POST /topics/timings HTTP/2.0", upstream: "http://127.0.0.1:3000/topics/timings", host: "fake.site", referrer: "https://fake.site/t/dream-thread/201/445"
2017/12/22 07:47:56 [error] 286#286: *21 upstream timed out (110: Connection timed out) while reading response header from upstream, client: [redacted], server: _, reques
t: "POST /message-bus/f053c65c9e3e4c16ae8e464b97f1c406/poll?dlp=t HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/f053c65c9e3e4c16ae8e464b97f1c406/poll?dlp=t", host:
 "fake.site", referrer: "https://fake.site/"

An interesting portion from error.log.1.

2017/12/21 16:14:44 [error] 28678#28678: *7302014 client intended to send too large body: 10556646 bytes, client: [redacted], server: _, request: "POST /uploads.json?clie
nt_id=[redacted]&authenticity_token=[redacted] HTTP/2.0", host: "
fake.site", referrer: "https://fake.site/t/things-you-find-a-lot-funnier-than-you-probably-should/1502/4702"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930084?_=1513844
206451 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930087?_=1513844
206452 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930091?_=1513844
206453 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930095?_=1513844
206454 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930112?_=1513844
206455 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930117?_=1513844
206456 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930118?_=1513844
206457 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930125?_=1513844
206458 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930128?_=1513844
206459 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930129?_=1513844
206460 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930131?_=1513844
206461 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930141?_=1513844
206462 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930107?_=1513844
206463 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930114?_=1513844
206464 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930125?_=1513844
206465 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930129?_=1513844
206466 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930131?_=1513844
206467 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/21 18:39:54 [error] 28679#28679: *7319794 limiting requests, excess: 12.312 by zone "flood", client: [redacted], server: _, request: "GET /posts/930135?_=1513844206468 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/off-derpic-xi-the-undiscovered-derpic/3422/13557"
2017/12/22 02:41:32 [error] 28678#28678: *7391603 limiting requests, excess: 12.676 by zone "flood", client: [redacted], server: _, request: "GET /posts/930314?_=1513882146189 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/things-you-find-a-lot-funnier-than-you-probably-should/1502/4726"
2017/12/22 02:41:32 [error] 28678#28678: *7391603 limiting requests, excess: 12.676 by zone "flood", client: [redacted], server: _, request: "GET /posts/930440?_=1513882146190 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/things-you-find-a-lot-funnier-than-you-probably-should/1502/4726"
2017/12/22 02:41:32 [error] 28678#28678: *7391603 limiting requests, excess: 12.676 by zone "flood", client: [redacted], server: _, request: "GET /posts/930433?_=1513882146191 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/things-you-find-a-lot-funnier-than-you-probably-should/1502/4726"
2017/12/22 02:41:32 [error] 28678#28678: *7391603 limiting requests, excess: 12.676 by zone "flood", client: [redacted], server: _, request: "GET /posts/930433?_=1513882146192 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/things-you-find-a-lot-funnier-than-you-probably-should/1502/4726"
2017/12/22 02:41:32 [error] 28678#28678: *7391603 limiting requests, excess: 12.676 by zone "flood", client: [redacted], server: _, request: "GET /posts/930558?_=1513882146193 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/things-you-find-a-lot-funnier-than-you-probably-should/1502/4726"
2017/12/22 02:41:32 [error] 28678#28678: *7391603 limiting requests, excess: 12.676 by zone "flood", client: [redacted], server: _, request: "GET /posts/930140?_=1513882146194 HTTP/2.0", host: "fake.site", referrer: "https://fake.site/t/things-you-find-a-lot-funnier-than-you-probably-should/1502/4726"

Judging by the tail ends of the nginx logs (not posted above), requests are still getting logged. Nobody’s getting any responses though.

That’s the smoking gun – connecting to Redis is timing out. We’ve seen this a bit lately, but have never been able to pin down why it’s happening. Based on the top output, you’re not short of memory, so that’s one potential culprit out of the way. Do the container logs (./launcher logs app) show anything Redis-related?

2 Likes
279:M 22 Dec 20:01:09.036 # Server started, Redis version 3.0.6
279:M 22 Dec 20:01:09.036 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
279:M 22 Dec 20:01:09.038 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.

Only thing of note mentioning Redis. Should I give the THP configuration a go?

Nah, THP isn’t the problem here.

Was that ps output taken while the problem was occuring? I see redis is running, there, and not really doing anything.

At this point, all my remaining diagnostic suggestions are fairly involved. How are you with strace, lsof, and tcpdump? They’re the tools you’ll need to see what’s going on. Alternately, if you’re open to letting me onto the machine, I can do the investigation myself.

1 Like

I believe the ouput of ps was taken while the problem was occuring. I meant it more in the sense of “before requests stop being responded to” and “after” that. Just in case, here’s another ps -aux. The container has been running for about an hour and is not responding to requests, so what you see here is reflective of the system state while the problem is occuring.

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1  37912  3172 ?        Ss   08:11   0:05 /sbin/init
root         2  0.0  0.0      0     0 ?        S    08:11   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    08:11   0:02 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S<   08:11   0:00 [kworker/0:0H]
root         7  0.0  0.0      0     0 ?        S    08:11   0:19 [rcu_sched]
root         8  0.0  0.0      0     0 ?        S    08:11   0:00 [rcu_bh]
root         9  0.0  0.0      0     0 ?        S    08:11   0:00 [migration/0]
root        10  0.0  0.0      0     0 ?        S    08:11   0:00 [watchdog/0]
root        11  0.0  0.0      0     0 ?        S    08:11   0:00 [watchdog/1]
root        12  0.0  0.0      0     0 ?        S    08:11   0:00 [migration/1]
root        13  0.0  0.0      0     0 ?        S    08:11   0:02 [ksoftirqd/1]
root        15  0.0  0.0      0     0 ?        S<   08:11   0:00 [kworker/1:0H]
root        16  0.0  0.0      0     0 ?        S    08:11   0:00 [kdevtmpfs]
root        17  0.0  0.0      0     0 ?        S<   08:11   0:00 [netns]
root        18  0.0  0.0      0     0 ?        S<   08:11   0:00 [perf]
root        19  0.0  0.0      0     0 ?        S    08:11   0:00 [khungtaskd]
root        20  0.0  0.0      0     0 ?        S<   08:11   0:00 [writeback]
root        21  0.0  0.0      0     0 ?        SN   08:11   0:00 [ksmd]
root        22  0.0  0.0      0     0 ?        SN   08:11   0:04 [khugepaged]
root        23  0.0  0.0      0     0 ?        S<   08:11   0:00 [crypto]
root        24  0.0  0.0      0     0 ?        S<   08:11   0:00 [kintegrityd]
root        25  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        26  0.0  0.0      0     0 ?        S<   08:11   0:00 [kblockd]
root        27  0.0  0.0      0     0 ?        S<   08:11   0:00 [ata_sff]
root        28  0.0  0.0      0     0 ?        S<   08:11   0:00 [md]
root        29  0.0  0.0      0     0 ?        S<   08:11   0:00 [devfreq_wq]
root        33  0.1  0.0      0     0 ?        S    08:11   0:43 [kswapd0]
root        34  0.0  0.0      0     0 ?        S<   08:11   0:00 [vmstat]
root        35  0.0  0.0      0     0 ?        S    08:11   0:00 [fsnotify_mark]
root        36  0.0  0.0      0     0 ?        S    08:11   0:00 [ecryptfs-kthrea]
root        52  0.0  0.0      0     0 ?        S<   08:11   0:00 [kthrotld]
root        53  0.0  0.0      0     0 ?        S<   08:11   0:00 [acpi_thermal_pm]
root        54  0.0  0.0      0     0 ?        S    08:11   0:00 [vballoon]
root        55  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        56  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        57  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        58  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        59  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        60  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        61  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        62  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        63  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        64  0.0  0.0      0     0 ?        S    08:11   0:00 [scsi_eh_0]
root        65  0.0  0.0      0     0 ?        S<   08:11   0:00 [scsi_tmf_0]
root        66  0.0  0.0      0     0 ?        S    08:11   0:00 [scsi_eh_1]
root        67  0.0  0.0      0     0 ?        S<   08:11   0:00 [scsi_tmf_1]
root        73  0.0  0.0      0     0 ?        S<   08:11   0:00 [ipv6_addrconf]
root        86  0.0  0.0      0     0 ?        S<   08:11   0:00 [deferwq]
root        87  0.0  0.0      0     0 ?        S<   08:11   0:00 [charger_manager]
root       124  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       125  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       126  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       127  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       128  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       129  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       130  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       131  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       132  0.0  0.0      0     0 ?        S    08:11   0:00 [scsi_eh_2]
root       133  0.0  0.0      0     0 ?        S<   08:11   0:00 [kpsmoused]
root       134  0.0  0.0      0     0 ?        S<   08:11   0:00 [scsi_tmf_2]
root       135  0.0  0.0      0     0 ?        S<   08:11   0:00 [ttm_swap]
root       136  0.0  0.0      0     0 ?        S<   08:11   0:00 [qxl_gc]
root       413  0.0  0.0      0     0 ?        S    08:11   0:02 [jbd2/vda1-8]
root       414  0.0  0.0      0     0 ?        S<   08:11   0:00 [ext4-rsv-conver]
root       455  0.0  0.0      0     0 ?        S<   08:11   0:00 [kworker/0:1H]
root       456  0.0  0.0  29648  1824 ?        Ss   08:11   0:07 /lib/systemd/systemd-journald
root       479  0.0  0.0      0     0 ?        S    08:11   0:00 [kauditd]
root       495  0.0  0.0  44432  1468 ?        Ss   08:11   0:00 /lib/systemd/systemd-udevd
root       532  0.0  0.0      0     0 ?        S<   08:11   0:00 [kworker/1:1H]
systemd+   627  0.0  0.0 100324   140 ?        Ssl  08:11   0:00 /lib/systemd/systemd-timesyncd
root       750  0.0  0.0      0     0 ?        S<   08:11   0:00 [kvm-irqfd-clean]
root       991  0.0  0.0   4400    80 ?        Ss   08:12   0:00 /usr/sbin/acpid
root       997  0.0  0.0  65520  1032 ?        Ss   08:12   0:02 /usr/sbin/sshd -D
root      1004  0.0  0.0  29880   220 ?        Ss   08:12   0:00 /sbin/cgmanager -m name=systemd
syslog    1008  0.0  0.0 256396   664 ?        Ssl  08:12   0:02 /usr/sbin/rsyslogd -n
root      1012  0.0  0.0 275872   960 ?        Ssl  08:12   0:03 /usr/lib/accountsservice/accounts-daemon
root      1015  0.0  0.0  29008   512 ?        Ss   08:12   0:00 /usr/sbin/cron -f
message+  1020  0.0  0.0  42900   476 ?        Ss   08:12   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
daemon    1030  0.0  0.0  26044   196 ?        Ss   08:12   0:00 /usr/sbin/atd -f
root      1031  0.0  0.0  28620   628 ?        Ss   08:12   0:00 /lib/systemd/systemd-logind
root      1038  0.4  1.7 530876 36640 ?        Ssl  08:12   2:51 /usr/bin/dockerd -H fd://
root      1068  0.0  0.0  19472   104 ?        Ss   08:12   0:01 /usr/sbin/irqbalance --pid=/var/run/irqbalance.pid
root      1075  0.0  0.0 277088   696 ?        Ssl  08:12   0:00 /usr/lib/policykit-1/polkitd --no-debug
root      1091  0.0  0.0  15940   144 tty1     Ss+  08:12   0:00 /sbin/agetty --noclear tty1 linux
root      1095  0.1  0.7 383184 15316 ?        Ssl  08:12   1:13 docker-containerd --config /var/run/docker/containerd/containerd.toml
root      3762  0.0  0.0      0     0 ?        S    17:44   0:00 [kworker/1:1]
root      4026  0.0  0.0 116408   108 ?        Sl   17:44   0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 443 -container-ip 172.17.0.2 -container-port 44
root      4038  0.0  0.0 118072   424 ?        Sl   17:44   0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 80 -container-ip 172.17.0.2 -container-port 80
root      4044  0.0  0.0 363444  1940 ?        Sl   17:44   0:00 docker-containerd-shim --namespace moby --workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1
root      4060  0.0  0.0  21288     0 pts/0    Ss+  17:44   0:00 /bin/bash /sbin/boot
root      4357  0.0  0.0   4396    28 pts/0    S+   17:44   0:00 /usr/bin/runsvdir -P /etc/service
root      4358  0.0  0.0   4244     4 ?        Ss   17:44   0:00 runsv cron
root      4359  0.0  0.0   4244     0 ?        Ss   17:44   0:00 runsv nginx
root      4360  0.0  0.0   4244     4 ?        Ss   17:44   0:00 runsv postgres
root      4361  0.0  0.0  82724   888 ?        S    17:44   0:00 nginx: master process /usr/sbin/nginx
root      4362  0.0  0.0  29300  1108 ?        S    17:44   0:00 cron -f
root      4363  0.0  0.0   4244     0 ?        Ss   17:44   0:00 runsv unicorn
root      4364  0.0  0.0   4244    16 ?        Ss   17:44   0:00 runsv redis
root      4365  0.0  0.0   4244     0 ?        Ss   17:44   0:00 runsv rsyslog
systemd+  4366  0.0  0.5 295044 12132 ?        S    17:44   0:00 /usr/lib/postgresql/9.5/bin/postmaster -D /etc/postgresql/9.5/main
tmpname   4367  0.0  0.1  29900  3756 ?        S    17:44   0:02 /bin/bash config/unicorn_launcher -E production -c config/unicorn.conf.rb
systemd+  4368  0.0  0.0 182664   164 ?        Sl   17:44   0:00 rsyslogd -n
_apt      4369  0.1 10.8 261612 222176 ?       Sl   17:44   0:04 /usr/bin/redis-server *:6379
www-data  4380  0.0  0.2  84376  4380 ?        S    17:44   0:02 nginx: worker process
www-data  4381  0.0  0.2  84332  5708 ?        S    17:44   0:01 nginx: worker process
www-data  4382  0.0  0.0  82892   608 ?        S    17:44   0:00 nginx: cache manager process
systemd+  4385  0.0  0.2 295156  4684 ?        Ss   17:44   0:00 postgres: 9.5/main: checkpointer process
systemd+  4386  0.0  3.1 295044 64104 ?        Ss   17:44   0:00 postgres: 9.5/main: writer process
systemd+  4387  0.0  0.2 295044  4920 ?        Ss   17:44   0:00 postgres: 9.5/main: wal writer process
systemd+  4388  0.0  0.1 295472  3120 ?        Ss   17:44   0:00 postgres: 9.5/main: autovacuum launcher process
systemd+  4389  0.0  0.0 150348   820 ?        Ss   17:44   0:00 postgres: 9.5/main: stats collector process
tmpname   4392  0.7  7.0 472664 143456 ?       Sl   17:44   0:25 unicorn master -E production -c config/unicorn.conf.rb
tmpname   4495  1.7 18.8 1830520 385064 ?      Sl   17:44   0:57 sidekiq 5.0.5 discourse [0 of 5 busy]
systemd+  4800  0.4  8.4 342592 173608 ?       Ss   17:45   0:13 postgres: 9.5/main: discourse discourse [local] SELECT
root      4809  0.0  0.0      0     0 ?        S    17:45   0:00 [kworker/u4:1]
systemd+  5208  0.7  7.0 307028 143704 ?       Ss   17:46   0:24 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+  5682  0.0  7.3 313932 149524 ?       Ss   17:47   0:01 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+  5889  0.0  1.8 308660 38316 ?        Ss   17:48   0:00 postgres: 9.5/main: discourse discourse [local] SELECT
root      7792  0.0  0.0      0     0 ?        S    18:12   0:00 [kworker/0:1]
root      7863  0.0  0.0      0     0 ?        S    18:12   0:00 [kworker/1:2]
root      7923  0.0  0.0      0     0 ?        S    18:12   0:00 [kworker/u4:0]
tmpname   9855  1.0 11.3 1628044 232600 ?      Sl   18:33   0:03 unicorn worker[0] -E production -c config/unicorn.conf.rb
tmpname  10343 15.0 11.2 1627016 230736 ?      Sl   18:39   0:02 unicorn worker[1] -E production -c config/unicorn.conf.rb
systemd+ 10362  0.0  0.7 302932 15832 ?        Ss   18:39   0:00 postgres: 9.5/main: discourse discourse [local] idle
tmpname  10389  1.0  0.1  20148  3560 ?        S    18:39   0:00 sleep 1
root     10390  0.0  0.1  37364  3300 pts/0    R+   18:39   0:00 ps -aux
root     20792  0.0  0.0  99512   840 ?        Ss   15:35   0:00 sshd: tmpname [priv]
tmpname  20812  0.0  0.1  45248  2208 ?        Ss   15:36   0:00 /lib/systemd/systemd --user
tmpname  20816  0.0  0.0  63424   152 ?        S    15:36   0:00 (sd-pam)
tmpname  20924  0.0  0.1 100296  2052 ?        S    15:36   0:01 sshd: tmpname@pts/0
tmpname  20925  0.0  0.0  24324   736 pts/0    Ss   15:36   0:00 -bash
root     20957  0.0  0.0  56836   748 pts/0    S    15:36   0:00 sudo bash
root     20958  0.0  0.1  24284  3024 pts/0    S    15:36   0:00 bash
root     26276  0.0  0.0      0     0 ?        S    16:00   0:00 [kworker/0:0]

I know of strace and its purpose but haven’t had a lot of practice with it. None at all with lsof or tcpdump. I’d be fine letting you poke around on the machine. Let me take a snapshot of the machine first so that if anything goes awry, you won’t have to worry about it being a big deal. :slight_smile: It’s a for-fun community forum with no financial/other major concerns about the Discourse machine being lost, and downtime isn’t a major issue, but just in case I’ll back things up starting now unless you think it’s absolutely not necessary.

PM me and we can arrange access.

Remove anything here that does not begin with https://github.com/discourse – we also have an official mathjax plugin, as I recall? Yes it is at https://github.com/discourse/discourse-math

1 Like

The no-plugins dance has already been tried:

Also, timing out whilst connecting to Redis really shouldn’t be possible to trigger via a plugin. That’d be some powerful magicks. Then again, timing out whilst trying to connect to localhost shouldn’t be possible at all…

2 Likes

If the time comes I’ll give it a go, as I’m not sure if I tried ONLY Discourse official plugins, but I did try no plugins and a few other arrangements. I probably won’t be touching it once I hand it over to Matt though.

Thanks for pointing out the MathJax plugin. Will definitely make use of that in the future!

Hoookay… after @Tsirist very kindly let me dig around, I’ve been able to narrow this down to a Docker bug of some sort. I’m having a great run of those lately.

tl;dr: if you are seeing Redis::TimeoutError in your Unicorn logs, try restarting Docker (service docker restart). That appears to make the problem go away, at least for a while.

The rest of this post will go into a lot of detail about what’s going on. You can safely ignore it if you’re not a fan of the deep minutiae of Unix-like systems. Mostly I’m writing all this down because otherwise I’ll forget it.

Under normal operation, Docker containers capture everything that gets written to the stdout/stderr of the processes running in the container, and log them. This is the log that gets written when you run ./launcher logs app (or, if you interact directly with Docker, docker logs app). This log data is captured by pointing the container’s “default” stdout/stderr to pipes (or a pseudo-terminal, if you start the container with docker run -t), and having something at the other end of the pipe (or pty) read that data and write it to disk. In Docker, that is a process called docker-containerd-shim.

The problem, it appears, is that this process appears to sometimes stop reading data from the pipe (or pty). This then causes the buffer in the pipe/pty to fill up, and when a pipe’s buffer fills up, attempts to write to the pipe block (or fail, if O_NONBLOCK is set, which is never is because nobody expects a write to stdout/stderr to block). Because this write blocks, the entire thread of execution seizes up, and that’s pretty much the end of that.

That’s why attempts to connect to Redis eventually timeout – Redis tries to write to stdout, that blocks, so Redis is now completely wedged – nothing’s doing anything any more. So, when new connections are attempted, they pile up in the accept(2) backlog queue, which eventually times out the connect(2) attempt because the connection was never actually accepted.

Incidentally, PgSQL connections are also seizing up (trying to write query logs), but for whatever reason the app’s trying to connect to Redis first, so that’s what causes the error.

I haven’t been able to figure out exactly why the shim process stops reading, exactly. I restarted docker to test a theory, and the problem stopped happening. My surmise, given that restarting just the container doesn’t fix the problem, that the core problem is actually in the docker daemon itself, and that just causes some sort of brainfart in the shim. If I get an opportunity to examine another misbehaving system, I’ll certainly look into the problem further.

For now, though, the solution to clearing the problem appears to be to restart Docker. At least that stops the problem from being a “site runs for five minutes, then falls over” to “site runs for at least a couple of hours”, which is an improvement… I guess…

18 Likes

So it’s not gremlins? This is something of a relief.

And you think this explains the bootstrap problem too and the solution is to just reboot? (overkill just to restart docker, but that’d do it). I’ve been working with computers for nearly 40 years. The first question is always ‘is it plugged in?’ and the second is “did you reboot?” I’m back to feeling foolish. (at least rebooting doesn’t actually fix the problem)

And I thought I was so clever just restating the data container.

Have you seen any reports of this bug elsewhere?

You’re welcome to have a look at my system again. It went down a couple times yesterday. I’m pretty sure it’ll die again if you bootstrap multi

2 Likes

It’s probably best to keep the discussion of the issues you’ve been seeing over in the other thread, because they’re more varied, and I can’t account for all of them with this bug. Also, I don’t know whether rebooting the machine will make the problem better or worse – I’ve only tried a service restart, which is a very different beast, to Docker.

4 Likes

Ladies, Gentlemen, and Small Furry Creatures From Alpha Centauri, I present to you: the Docker bug that has been causing all the angst. I got a reliable repro, and verified that only docker-ce 17.11 is impacted. Hopefully, Docker will fix the bug sooner or later, but until then, the recommended workaround is to downgrade Docker to a working version, as follows:

  1. Stop and delete all running containers. In addition to this bug, Docker also changed some other things in the way that containers are run, and a downgrade while containers are still running will end in tears. If you’re only running Discourse on your server, you can just stop and delete the container with docker rm -f app (your data is safe, and won’t be deleted by this command). If you’re running other containers on the machine as well, you’ll have to figure out what to do.

  2. Downgrade Docker. Via the magic of the package manager, apt-get install docker-ce=17.10.0~ce-0~ubuntu will do the trick. You’ll have to say y to the installation, because it’s a downgrade.

  3. (optional) Make sure Docker doesn’t get automatically upgraded again. That’d really ruin your day, because not only would you have a buggy Docker behind the wheel again, but due to the aforementioned changes in how containers are run, an automated upgrade would likely leave your containers in a kind of limbo state. Not cool.

    To make sure you stay on a known-good version, create a file /etc/apt/preferences.d/docker.pref with the following contents:

     Package: docker-ce
     Pin: version 17.10*
     Pin-Priority: 9001
23 Likes

I can confirm that the Docker downgrade seems to work. I reverted my machine to a snapshot fresh after my own initial investigation and before @mpalmer had done anything, so nothing potentially useful should have been done at that point. I then stopped the app with ./launcher stop app, downgraded with apt-get install docker-ce=17.10.0~ce-0~ubunt, restarted the Docker service with service docker restart, and rebuilt my discourse instance to be safe with ./launcher rebuild app.

So far up and running for about an hour with no hitches, although there was an initially large amount of CPU usage that caused Discourse to chug while processing requests. I’m not sure what it was but figured it was the result of Discourse having been offline for more than a day (due to the snapshot timing). In any case, that subsided after about 10 or 15 minutes and all has been well since!

Big thanks @mpalmer for taking the time to look at our machine and find the problem! There’s no way we’d have a working forum at the moment without your help. :slight_smile:

7 Likes

Wow that is some incredible detective work, well done Matt. Here’s hoping Docker fixes this ASAP.

7 Likes

I do love me a good puzzle.

15 Likes

I assuming the backup issue was related to this bug.

And we were this close (meaning we totally did) to accuse you of this :sob:

2 Likes

The out of memory during scheduled backups was fixed separately last week.

3 Likes

Yes, a standalone (Redis, PgSQL, and app all in one container) installation seizing up at the end of a backup could most definitely be caused by this Docker bug, because a completed backup has the definite potential to write extremely long lines to the container’s stdout, which is what triggers the Docker bug.

If your system was running closeish to the available memory in normal operation, then the problem may have been the backup process taking up more memory than it needed to, as was fixed by @sam last week. You can confirm that by looking for records of OOM killing in the system dmesg.

8 Likes