Discourse web interface becomes unresponsive a few minutes after starting

Nah, THP isn’t the problem here.

Was that ps output taken while the problem was occuring? I see redis is running, there, and not really doing anything.

At this point, all my remaining diagnostic suggestions are fairly involved. How are you with strace, lsof, and tcpdump? They’re the tools you’ll need to see what’s going on. Alternately, if you’re open to letting me onto the machine, I can do the investigation myself.

1 Like

I believe the ouput of ps was taken while the problem was occuring. I meant it more in the sense of “before requests stop being responded to” and “after” that. Just in case, here’s another ps -aux. The container has been running for about an hour and is not responding to requests, so what you see here is reflective of the system state while the problem is occuring.

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1  37912  3172 ?        Ss   08:11   0:05 /sbin/init
root         2  0.0  0.0      0     0 ?        S    08:11   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    08:11   0:02 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S<   08:11   0:00 [kworker/0:0H]
root         7  0.0  0.0      0     0 ?        S    08:11   0:19 [rcu_sched]
root         8  0.0  0.0      0     0 ?        S    08:11   0:00 [rcu_bh]
root         9  0.0  0.0      0     0 ?        S    08:11   0:00 [migration/0]
root        10  0.0  0.0      0     0 ?        S    08:11   0:00 [watchdog/0]
root        11  0.0  0.0      0     0 ?        S    08:11   0:00 [watchdog/1]
root        12  0.0  0.0      0     0 ?        S    08:11   0:00 [migration/1]
root        13  0.0  0.0      0     0 ?        S    08:11   0:02 [ksoftirqd/1]
root        15  0.0  0.0      0     0 ?        S<   08:11   0:00 [kworker/1:0H]
root        16  0.0  0.0      0     0 ?        S    08:11   0:00 [kdevtmpfs]
root        17  0.0  0.0      0     0 ?        S<   08:11   0:00 [netns]
root        18  0.0  0.0      0     0 ?        S<   08:11   0:00 [perf]
root        19  0.0  0.0      0     0 ?        S    08:11   0:00 [khungtaskd]
root        20  0.0  0.0      0     0 ?        S<   08:11   0:00 [writeback]
root        21  0.0  0.0      0     0 ?        SN   08:11   0:00 [ksmd]
root        22  0.0  0.0      0     0 ?        SN   08:11   0:04 [khugepaged]
root        23  0.0  0.0      0     0 ?        S<   08:11   0:00 [crypto]
root        24  0.0  0.0      0     0 ?        S<   08:11   0:00 [kintegrityd]
root        25  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        26  0.0  0.0      0     0 ?        S<   08:11   0:00 [kblockd]
root        27  0.0  0.0      0     0 ?        S<   08:11   0:00 [ata_sff]
root        28  0.0  0.0      0     0 ?        S<   08:11   0:00 [md]
root        29  0.0  0.0      0     0 ?        S<   08:11   0:00 [devfreq_wq]
root        33  0.1  0.0      0     0 ?        S    08:11   0:43 [kswapd0]
root        34  0.0  0.0      0     0 ?        S<   08:11   0:00 [vmstat]
root        35  0.0  0.0      0     0 ?        S    08:11   0:00 [fsnotify_mark]
root        36  0.0  0.0      0     0 ?        S    08:11   0:00 [ecryptfs-kthrea]
root        52  0.0  0.0      0     0 ?        S<   08:11   0:00 [kthrotld]
root        53  0.0  0.0      0     0 ?        S<   08:11   0:00 [acpi_thermal_pm]
root        54  0.0  0.0      0     0 ?        S    08:11   0:00 [vballoon]
root        55  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        56  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        57  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        58  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        59  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        60  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        61  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        62  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        63  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root        64  0.0  0.0      0     0 ?        S    08:11   0:00 [scsi_eh_0]
root        65  0.0  0.0      0     0 ?        S<   08:11   0:00 [scsi_tmf_0]
root        66  0.0  0.0      0     0 ?        S    08:11   0:00 [scsi_eh_1]
root        67  0.0  0.0      0     0 ?        S<   08:11   0:00 [scsi_tmf_1]
root        73  0.0  0.0      0     0 ?        S<   08:11   0:00 [ipv6_addrconf]
root        86  0.0  0.0      0     0 ?        S<   08:11   0:00 [deferwq]
root        87  0.0  0.0      0     0 ?        S<   08:11   0:00 [charger_manager]
root       124  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       125  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       126  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       127  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       128  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       129  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       130  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       131  0.0  0.0      0     0 ?        S<   08:11   0:00 [bioset]
root       132  0.0  0.0      0     0 ?        S    08:11   0:00 [scsi_eh_2]
root       133  0.0  0.0      0     0 ?        S<   08:11   0:00 [kpsmoused]
root       134  0.0  0.0      0     0 ?        S<   08:11   0:00 [scsi_tmf_2]
root       135  0.0  0.0      0     0 ?        S<   08:11   0:00 [ttm_swap]
root       136  0.0  0.0      0     0 ?        S<   08:11   0:00 [qxl_gc]
root       413  0.0  0.0      0     0 ?        S    08:11   0:02 [jbd2/vda1-8]
root       414  0.0  0.0      0     0 ?        S<   08:11   0:00 [ext4-rsv-conver]
root       455  0.0  0.0      0     0 ?        S<   08:11   0:00 [kworker/0:1H]
root       456  0.0  0.0  29648  1824 ?        Ss   08:11   0:07 /lib/systemd/systemd-journald
root       479  0.0  0.0      0     0 ?        S    08:11   0:00 [kauditd]
root       495  0.0  0.0  44432  1468 ?        Ss   08:11   0:00 /lib/systemd/systemd-udevd
root       532  0.0  0.0      0     0 ?        S<   08:11   0:00 [kworker/1:1H]
systemd+   627  0.0  0.0 100324   140 ?        Ssl  08:11   0:00 /lib/systemd/systemd-timesyncd
root       750  0.0  0.0      0     0 ?        S<   08:11   0:00 [kvm-irqfd-clean]
root       991  0.0  0.0   4400    80 ?        Ss   08:12   0:00 /usr/sbin/acpid
root       997  0.0  0.0  65520  1032 ?        Ss   08:12   0:02 /usr/sbin/sshd -D
root      1004  0.0  0.0  29880   220 ?        Ss   08:12   0:00 /sbin/cgmanager -m name=systemd
syslog    1008  0.0  0.0 256396   664 ?        Ssl  08:12   0:02 /usr/sbin/rsyslogd -n
root      1012  0.0  0.0 275872   960 ?        Ssl  08:12   0:03 /usr/lib/accountsservice/accounts-daemon
root      1015  0.0  0.0  29008   512 ?        Ss   08:12   0:00 /usr/sbin/cron -f
message+  1020  0.0  0.0  42900   476 ?        Ss   08:12   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
daemon    1030  0.0  0.0  26044   196 ?        Ss   08:12   0:00 /usr/sbin/atd -f
root      1031  0.0  0.0  28620   628 ?        Ss   08:12   0:00 /lib/systemd/systemd-logind
root      1038  0.4  1.7 530876 36640 ?        Ssl  08:12   2:51 /usr/bin/dockerd -H fd://
root      1068  0.0  0.0  19472   104 ?        Ss   08:12   0:01 /usr/sbin/irqbalance --pid=/var/run/irqbalance.pid
root      1075  0.0  0.0 277088   696 ?        Ssl  08:12   0:00 /usr/lib/policykit-1/polkitd --no-debug
root      1091  0.0  0.0  15940   144 tty1     Ss+  08:12   0:00 /sbin/agetty --noclear tty1 linux
root      1095  0.1  0.7 383184 15316 ?        Ssl  08:12   1:13 docker-containerd --config /var/run/docker/containerd/containerd.toml
root      3762  0.0  0.0      0     0 ?        S    17:44   0:00 [kworker/1:1]
root      4026  0.0  0.0 116408   108 ?        Sl   17:44   0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 443 -container-ip 172.17.0.2 -container-port 44
root      4038  0.0  0.0 118072   424 ?        Sl   17:44   0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 80 -container-ip 172.17.0.2 -container-port 80
root      4044  0.0  0.0 363444  1940 ?        Sl   17:44   0:00 docker-containerd-shim --namespace moby --workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1
root      4060  0.0  0.0  21288     0 pts/0    Ss+  17:44   0:00 /bin/bash /sbin/boot
root      4357  0.0  0.0   4396    28 pts/0    S+   17:44   0:00 /usr/bin/runsvdir -P /etc/service
root      4358  0.0  0.0   4244     4 ?        Ss   17:44   0:00 runsv cron
root      4359  0.0  0.0   4244     0 ?        Ss   17:44   0:00 runsv nginx
root      4360  0.0  0.0   4244     4 ?        Ss   17:44   0:00 runsv postgres
root      4361  0.0  0.0  82724   888 ?        S    17:44   0:00 nginx: master process /usr/sbin/nginx
root      4362  0.0  0.0  29300  1108 ?        S    17:44   0:00 cron -f
root      4363  0.0  0.0   4244     0 ?        Ss   17:44   0:00 runsv unicorn
root      4364  0.0  0.0   4244    16 ?        Ss   17:44   0:00 runsv redis
root      4365  0.0  0.0   4244     0 ?        Ss   17:44   0:00 runsv rsyslog
systemd+  4366  0.0  0.5 295044 12132 ?        S    17:44   0:00 /usr/lib/postgresql/9.5/bin/postmaster -D /etc/postgresql/9.5/main
tmpname   4367  0.0  0.1  29900  3756 ?        S    17:44   0:02 /bin/bash config/unicorn_launcher -E production -c config/unicorn.conf.rb
systemd+  4368  0.0  0.0 182664   164 ?        Sl   17:44   0:00 rsyslogd -n
_apt      4369  0.1 10.8 261612 222176 ?       Sl   17:44   0:04 /usr/bin/redis-server *:6379
www-data  4380  0.0  0.2  84376  4380 ?        S    17:44   0:02 nginx: worker process
www-data  4381  0.0  0.2  84332  5708 ?        S    17:44   0:01 nginx: worker process
www-data  4382  0.0  0.0  82892   608 ?        S    17:44   0:00 nginx: cache manager process
systemd+  4385  0.0  0.2 295156  4684 ?        Ss   17:44   0:00 postgres: 9.5/main: checkpointer process
systemd+  4386  0.0  3.1 295044 64104 ?        Ss   17:44   0:00 postgres: 9.5/main: writer process
systemd+  4387  0.0  0.2 295044  4920 ?        Ss   17:44   0:00 postgres: 9.5/main: wal writer process
systemd+  4388  0.0  0.1 295472  3120 ?        Ss   17:44   0:00 postgres: 9.5/main: autovacuum launcher process
systemd+  4389  0.0  0.0 150348   820 ?        Ss   17:44   0:00 postgres: 9.5/main: stats collector process
tmpname   4392  0.7  7.0 472664 143456 ?       Sl   17:44   0:25 unicorn master -E production -c config/unicorn.conf.rb
tmpname   4495  1.7 18.8 1830520 385064 ?      Sl   17:44   0:57 sidekiq 5.0.5 discourse [0 of 5 busy]
systemd+  4800  0.4  8.4 342592 173608 ?       Ss   17:45   0:13 postgres: 9.5/main: discourse discourse [local] SELECT
root      4809  0.0  0.0      0     0 ?        S    17:45   0:00 [kworker/u4:1]
systemd+  5208  0.7  7.0 307028 143704 ?       Ss   17:46   0:24 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+  5682  0.0  7.3 313932 149524 ?       Ss   17:47   0:01 postgres: 9.5/main: discourse discourse [local] SELECT
systemd+  5889  0.0  1.8 308660 38316 ?        Ss   17:48   0:00 postgres: 9.5/main: discourse discourse [local] SELECT
root      7792  0.0  0.0      0     0 ?        S    18:12   0:00 [kworker/0:1]
root      7863  0.0  0.0      0     0 ?        S    18:12   0:00 [kworker/1:2]
root      7923  0.0  0.0      0     0 ?        S    18:12   0:00 [kworker/u4:0]
tmpname   9855  1.0 11.3 1628044 232600 ?      Sl   18:33   0:03 unicorn worker[0] -E production -c config/unicorn.conf.rb
tmpname  10343 15.0 11.2 1627016 230736 ?      Sl   18:39   0:02 unicorn worker[1] -E production -c config/unicorn.conf.rb
systemd+ 10362  0.0  0.7 302932 15832 ?        Ss   18:39   0:00 postgres: 9.5/main: discourse discourse [local] idle
tmpname  10389  1.0  0.1  20148  3560 ?        S    18:39   0:00 sleep 1
root     10390  0.0  0.1  37364  3300 pts/0    R+   18:39   0:00 ps -aux
root     20792  0.0  0.0  99512   840 ?        Ss   15:35   0:00 sshd: tmpname [priv]
tmpname  20812  0.0  0.1  45248  2208 ?        Ss   15:36   0:00 /lib/systemd/systemd --user
tmpname  20816  0.0  0.0  63424   152 ?        S    15:36   0:00 (sd-pam)
tmpname  20924  0.0  0.1 100296  2052 ?        S    15:36   0:01 sshd: tmpname@pts/0
tmpname  20925  0.0  0.0  24324   736 pts/0    Ss   15:36   0:00 -bash
root     20957  0.0  0.0  56836   748 pts/0    S    15:36   0:00 sudo bash
root     20958  0.0  0.1  24284  3024 pts/0    S    15:36   0:00 bash
root     26276  0.0  0.0      0     0 ?        S    16:00   0:00 [kworker/0:0]

I know of strace and its purpose but haven’t had a lot of practice with it. None at all with lsof or tcpdump. I’d be fine letting you poke around on the machine. Let me take a snapshot of the machine first so that if anything goes awry, you won’t have to worry about it being a big deal. :slight_smile: It’s a for-fun community forum with no financial/other major concerns about the Discourse machine being lost, and downtime isn’t a major issue, but just in case I’ll back things up starting now unless you think it’s absolutely not necessary.

PM me and we can arrange access.

Remove anything here that does not begin with https://github.com/discourse – we also have an official mathjax plugin, as I recall? Yes it is at https://github.com/discourse/discourse-math

1 Like

The no-plugins dance has already been tried:

Also, timing out whilst connecting to Redis really shouldn’t be possible to trigger via a plugin. That’d be some powerful magicks. Then again, timing out whilst trying to connect to localhost shouldn’t be possible at all…

2 Likes

If the time comes I’ll give it a go, as I’m not sure if I tried ONLY Discourse official plugins, but I did try no plugins and a few other arrangements. I probably won’t be touching it once I hand it over to Matt though.

Thanks for pointing out the MathJax plugin. Will definitely make use of that in the future!

Hoookay… after @Tsirist very kindly let me dig around, I’ve been able to narrow this down to a Docker bug of some sort. I’m having a great run of those lately.

tl;dr: if you are seeing Redis::TimeoutError in your Unicorn logs, try restarting Docker (service docker restart). That appears to make the problem go away, at least for a while.

The rest of this post will go into a lot of detail about what’s going on. You can safely ignore it if you’re not a fan of the deep minutiae of Unix-like systems. Mostly I’m writing all this down because otherwise I’ll forget it.

Under normal operation, Docker containers capture everything that gets written to the stdout/stderr of the processes running in the container, and log them. This is the log that gets written when you run ./launcher logs app (or, if you interact directly with Docker, docker logs app). This log data is captured by pointing the container’s “default” stdout/stderr to pipes (or a pseudo-terminal, if you start the container with docker run -t), and having something at the other end of the pipe (or pty) read that data and write it to disk. In Docker, that is a process called docker-containerd-shim.

The problem, it appears, is that this process appears to sometimes stop reading data from the pipe (or pty). This then causes the buffer in the pipe/pty to fill up, and when a pipe’s buffer fills up, attempts to write to the pipe block (or fail, if O_NONBLOCK is set, which is never is because nobody expects a write to stdout/stderr to block). Because this write blocks, the entire thread of execution seizes up, and that’s pretty much the end of that.

That’s why attempts to connect to Redis eventually timeout – Redis tries to write to stdout, that blocks, so Redis is now completely wedged – nothing’s doing anything any more. So, when new connections are attempted, they pile up in the accept(2) backlog queue, which eventually times out the connect(2) attempt because the connection was never actually accepted.

Incidentally, PgSQL connections are also seizing up (trying to write query logs), but for whatever reason the app’s trying to connect to Redis first, so that’s what causes the error.

I haven’t been able to figure out exactly why the shim process stops reading, exactly. I restarted docker to test a theory, and the problem stopped happening. My surmise, given that restarting just the container doesn’t fix the problem, that the core problem is actually in the docker daemon itself, and that just causes some sort of brainfart in the shim. If I get an opportunity to examine another misbehaving system, I’ll certainly look into the problem further.

For now, though, the solution to clearing the problem appears to be to restart Docker. At least that stops the problem from being a “site runs for five minutes, then falls over” to “site runs for at least a couple of hours”, which is an improvement… I guess…

17 Likes

So it’s not gremlins? This is something of a relief.

And you think this explains the bootstrap problem too and the solution is to just reboot? (overkill just to restart docker, but that’d do it). I’ve been working with computers for nearly 40 years. The first question is always ‘is it plugged in?’ and the second is “did you reboot?” I’m back to feeling foolish. (at least rebooting doesn’t actually fix the problem)

And I thought I was so clever just restating the data container.

Have you seen any reports of this bug elsewhere?

You’re welcome to have a look at my system again. It went down a couple times yesterday. I’m pretty sure it’ll die again if you bootstrap multi

2 Likes

It’s probably best to keep the discussion of the issues you’ve been seeing over in the other thread, because they’re more varied, and I can’t account for all of them with this bug. Also, I don’t know whether rebooting the machine will make the problem better or worse – I’ve only tried a service restart, which is a very different beast, to Docker.

4 Likes

Ladies, Gentlemen, and Small Furry Creatures From Alpha Centauri, I present to you: the Docker bug that has been causing all the angst. I got a reliable repro, and verified that only docker-ce 17.11 is impacted. Hopefully, Docker will fix the bug sooner or later, but until then, the recommended workaround is to downgrade Docker to a working version, as follows:

  1. Stop and delete all running containers. In addition to this bug, Docker also changed some other things in the way that containers are run, and a downgrade while containers are still running will end in tears. If you’re only running Discourse on your server, you can just stop and delete the container with docker rm -f app (your data is safe, and won’t be deleted by this command). If you’re running other containers on the machine as well, you’ll have to figure out what to do.

  2. Downgrade Docker. Via the magic of the package manager, apt-get install docker-ce=17.10.0~ce-0~ubuntu will do the trick. You’ll have to say y to the installation, because it’s a downgrade.

  3. (optional) Make sure Docker doesn’t get automatically upgraded again. That’d really ruin your day, because not only would you have a buggy Docker behind the wheel again, but due to the aforementioned changes in how containers are run, an automated upgrade would likely leave your containers in a kind of limbo state. Not cool.

    To make sure you stay on a known-good version, create a file /etc/apt/preferences.d/docker.pref with the following contents:

     Package: docker-ce
     Pin: version 17.10*
     Pin-Priority: 9001
23 Likes

I can confirm that the Docker downgrade seems to work. I reverted my machine to a snapshot fresh after my own initial investigation and before @mpalmer had done anything, so nothing potentially useful should have been done at that point. I then stopped the app with ./launcher stop app, downgraded with apt-get install docker-ce=17.10.0~ce-0~ubunt, restarted the Docker service with service docker restart, and rebuilt my discourse instance to be safe with ./launcher rebuild app.

So far up and running for about an hour with no hitches, although there was an initially large amount of CPU usage that caused Discourse to chug while processing requests. I’m not sure what it was but figured it was the result of Discourse having been offline for more than a day (due to the snapshot timing). In any case, that subsided after about 10 or 15 minutes and all has been well since!

Big thanks @mpalmer for taking the time to look at our machine and find the problem! There’s no way we’d have a working forum at the moment without your help. :slight_smile:

7 Likes

Wow that is some incredible detective work, well done Matt. Here’s hoping Docker fixes this ASAP.

7 Likes

I do love me a good puzzle.

15 Likes

I assuming the backup issue was related to this bug.

And we were this close (meaning we totally did) to accuse you of this :sob:

2 Likes

The out of memory during scheduled backups was fixed separately last week.

3 Likes

Yes, a standalone (Redis, PgSQL, and app all in one container) installation seizing up at the end of a backup could most definitely be caused by this Docker bug, because a completed backup has the definite potential to write extremely long lines to the container’s stdout, which is what triggers the Docker bug.

If your system was running closeish to the available memory in normal operation, then the problem may have been the backup process taking up more memory than it needed to, as was fixed by @sam last week. You can confirm that by looking for records of OOM killing in the system dmesg.

8 Likes

@sam has deployed a workaround in Discourse for this problem; if you rebuild with the latest discourse_docker changes, redis and pg logs should go to files, rather than Docker, and the bug shouldn’t(!) be triggered.

8 Likes

Again fantastic work @mpalmer and @sam. Top notch open source citizenry.

7 Likes

Once Docker releases a good version, should this file be removed?

2 Likes

Yep, once the bug is fixed the pin should be removed.

6 Likes

Docker is on version 17.12.0-ce, maybe this is fixed