Discourse web interface becomes unresponsive a few minutes after starting

Hoookay… after @Tsirist very kindly let me dig around, I’ve been able to narrow this down to a Docker bug of some sort. I’m having a great run of those lately.

tl;dr: if you are seeing Redis::TimeoutError in your Unicorn logs, try restarting Docker (service docker restart). That appears to make the problem go away, at least for a while.

The rest of this post will go into a lot of detail about what’s going on. You can safely ignore it if you’re not a fan of the deep minutiae of Unix-like systems. Mostly I’m writing all this down because otherwise I’ll forget it.

Under normal operation, Docker containers capture everything that gets written to the stdout/stderr of the processes running in the container, and log them. This is the log that gets written when you run ./launcher logs app (or, if you interact directly with Docker, docker logs app). This log data is captured by pointing the container’s “default” stdout/stderr to pipes (or a pseudo-terminal, if you start the container with docker run -t), and having something at the other end of the pipe (or pty) read that data and write it to disk. In Docker, that is a process called docker-containerd-shim.

The problem, it appears, is that this process appears to sometimes stop reading data from the pipe (or pty). This then causes the buffer in the pipe/pty to fill up, and when a pipe’s buffer fills up, attempts to write to the pipe block (or fail, if O_NONBLOCK is set, which is never is because nobody expects a write to stdout/stderr to block). Because this write blocks, the entire thread of execution seizes up, and that’s pretty much the end of that.

That’s why attempts to connect to Redis eventually timeout – Redis tries to write to stdout, that blocks, so Redis is now completely wedged – nothing’s doing anything any more. So, when new connections are attempted, they pile up in the accept(2) backlog queue, which eventually times out the connect(2) attempt because the connection was never actually accepted.

Incidentally, PgSQL connections are also seizing up (trying to write query logs), but for whatever reason the app’s trying to connect to Redis first, so that’s what causes the error.

I haven’t been able to figure out exactly why the shim process stops reading, exactly. I restarted docker to test a theory, and the problem stopped happening. My surmise, given that restarting just the container doesn’t fix the problem, that the core problem is actually in the docker daemon itself, and that just causes some sort of brainfart in the shim. If I get an opportunity to examine another misbehaving system, I’ll certainly look into the problem further.

For now, though, the solution to clearing the problem appears to be to restart Docker. At least that stops the problem from being a “site runs for five minutes, then falls over” to “site runs for at least a couple of hours”, which is an improvement… I guess…

18 Likes