Discourse web interface becomes unresponsive a few minutes after starting

The no-plugins dance has already been tried:

Also, timing out whilst connecting to Redis really shouldn’t be possible to trigger via a plugin. That’d be some powerful magicks. Then again, timing out whilst trying to connect to localhost shouldn’t be possible at all…

2 Likes

If the time comes I’ll give it a go, as I’m not sure if I tried ONLY Discourse official plugins, but I did try no plugins and a few other arrangements. I probably won’t be touching it once I hand it over to Matt though.

Thanks for pointing out the MathJax plugin. Will definitely make use of that in the future!

Hoookay… after @Tsirist very kindly let me dig around, I’ve been able to narrow this down to a Docker bug of some sort. I’m having a great run of those lately.

tl;dr: if you are seeing Redis::TimeoutError in your Unicorn logs, try restarting Docker (service docker restart). That appears to make the problem go away, at least for a while.

The rest of this post will go into a lot of detail about what’s going on. You can safely ignore it if you’re not a fan of the deep minutiae of Unix-like systems. Mostly I’m writing all this down because otherwise I’ll forget it.

Under normal operation, Docker containers capture everything that gets written to the stdout/stderr of the processes running in the container, and log them. This is the log that gets written when you run ./launcher logs app (or, if you interact directly with Docker, docker logs app). This log data is captured by pointing the container’s “default” stdout/stderr to pipes (or a pseudo-terminal, if you start the container with docker run -t), and having something at the other end of the pipe (or pty) read that data and write it to disk. In Docker, that is a process called docker-containerd-shim.

The problem, it appears, is that this process appears to sometimes stop reading data from the pipe (or pty). This then causes the buffer in the pipe/pty to fill up, and when a pipe’s buffer fills up, attempts to write to the pipe block (or fail, if O_NONBLOCK is set, which is never is because nobody expects a write to stdout/stderr to block). Because this write blocks, the entire thread of execution seizes up, and that’s pretty much the end of that.

That’s why attempts to connect to Redis eventually timeout – Redis tries to write to stdout, that blocks, so Redis is now completely wedged – nothing’s doing anything any more. So, when new connections are attempted, they pile up in the accept(2) backlog queue, which eventually times out the connect(2) attempt because the connection was never actually accepted.

Incidentally, PgSQL connections are also seizing up (trying to write query logs), but for whatever reason the app’s trying to connect to Redis first, so that’s what causes the error.

I haven’t been able to figure out exactly why the shim process stops reading, exactly. I restarted docker to test a theory, and the problem stopped happening. My surmise, given that restarting just the container doesn’t fix the problem, that the core problem is actually in the docker daemon itself, and that just causes some sort of brainfart in the shim. If I get an opportunity to examine another misbehaving system, I’ll certainly look into the problem further.

For now, though, the solution to clearing the problem appears to be to restart Docker. At least that stops the problem from being a “site runs for five minutes, then falls over” to “site runs for at least a couple of hours”, which is an improvement… I guess…

17 Likes

So it’s not gremlins? This is something of a relief.

And you think this explains the bootstrap problem too and the solution is to just reboot? (overkill just to restart docker, but that’d do it). I’ve been working with computers for nearly 40 years. The first question is always ‘is it plugged in?’ and the second is “did you reboot?” I’m back to feeling foolish. (at least rebooting doesn’t actually fix the problem)

And I thought I was so clever just restating the data container.

Have you seen any reports of this bug elsewhere?

You’re welcome to have a look at my system again. It went down a couple times yesterday. I’m pretty sure it’ll die again if you bootstrap multi

2 Likes

It’s probably best to keep the discussion of the issues you’ve been seeing over in the other thread, because they’re more varied, and I can’t account for all of them with this bug. Also, I don’t know whether rebooting the machine will make the problem better or worse – I’ve only tried a service restart, which is a very different beast, to Docker.

4 Likes

Ladies, Gentlemen, and Small Furry Creatures From Alpha Centauri, I present to you: the Docker bug that has been causing all the angst. I got a reliable repro, and verified that only docker-ce 17.11 is impacted. Hopefully, Docker will fix the bug sooner or later, but until then, the recommended workaround is to downgrade Docker to a working version, as follows:

  1. Stop and delete all running containers. In addition to this bug, Docker also changed some other things in the way that containers are run, and a downgrade while containers are still running will end in tears. If you’re only running Discourse on your server, you can just stop and delete the container with docker rm -f app (your data is safe, and won’t be deleted by this command). If you’re running other containers on the machine as well, you’ll have to figure out what to do.

  2. Downgrade Docker. Via the magic of the package manager, apt-get install docker-ce=17.10.0~ce-0~ubuntu will do the trick. You’ll have to say y to the installation, because it’s a downgrade.

  3. (optional) Make sure Docker doesn’t get automatically upgraded again. That’d really ruin your day, because not only would you have a buggy Docker behind the wheel again, but due to the aforementioned changes in how containers are run, an automated upgrade would likely leave your containers in a kind of limbo state. Not cool.

    To make sure you stay on a known-good version, create a file /etc/apt/preferences.d/docker.pref with the following contents:

     Package: docker-ce
     Pin: version 17.10*
     Pin-Priority: 9001
23 Likes

I can confirm that the Docker downgrade seems to work. I reverted my machine to a snapshot fresh after my own initial investigation and before @mpalmer had done anything, so nothing potentially useful should have been done at that point. I then stopped the app with ./launcher stop app, downgraded with apt-get install docker-ce=17.10.0~ce-0~ubunt, restarted the Docker service with service docker restart, and rebuilt my discourse instance to be safe with ./launcher rebuild app.

So far up and running for about an hour with no hitches, although there was an initially large amount of CPU usage that caused Discourse to chug while processing requests. I’m not sure what it was but figured it was the result of Discourse having been offline for more than a day (due to the snapshot timing). In any case, that subsided after about 10 or 15 minutes and all has been well since!

Big thanks @mpalmer for taking the time to look at our machine and find the problem! There’s no way we’d have a working forum at the moment without your help. :slight_smile:

7 Likes

Wow that is some incredible detective work, well done Matt. Here’s hoping Docker fixes this ASAP.

7 Likes

I do love me a good puzzle.

15 Likes

I assuming the backup issue was related to this bug.

And we were this close (meaning we totally did) to accuse you of this :sob:

2 Likes

The out of memory during scheduled backups was fixed separately last week.

3 Likes

Yes, a standalone (Redis, PgSQL, and app all in one container) installation seizing up at the end of a backup could most definitely be caused by this Docker bug, because a completed backup has the definite potential to write extremely long lines to the container’s stdout, which is what triggers the Docker bug.

If your system was running closeish to the available memory in normal operation, then the problem may have been the backup process taking up more memory than it needed to, as was fixed by @sam last week. You can confirm that by looking for records of OOM killing in the system dmesg.

8 Likes

@sam has deployed a workaround in Discourse for this problem; if you rebuild with the latest discourse_docker changes, redis and pg logs should go to files, rather than Docker, and the bug shouldn’t(!) be triggered.

8 Likes

Again fantastic work @mpalmer and @sam. Top notch open source citizenry.

7 Likes

Once Docker releases a good version, should this file be removed?

2 Likes

Yep, once the bug is fixed the pin should be removed.

6 Likes

Docker is on version 17.12.0-ce, maybe this is fixed

Yep, looks like it might have been fixed accidentally. There’s no indication that it was deliberately fixed in the bug report.

6 Likes

We had that issue as well, and when freeing up some space didn’t fix it, I upgraded Docker to 17.12.0-ce. It worked the next (almost) four days, and then hung again after the backup. It’s been doing that the last two days again. See Neos Project if you’re interested.

Is this fixed for everybody else? Or do the issues continue for anyone?

I pinned several sites to 17.10 and upgraded another to 17.12 yesterday.

I’ve not had any problems since.

2 Likes