Runsv hanging on Docker container shutdown

(Andy Balholm) #1

This morning we did some routine upgrades on our server (you know, the ones Ubuntu always nags you to do). The packages that were upgraded included the kernel and Docker.

Now when we stop our Docker container with Discourse in it, it hangs in runsv. One or more runsv processes inside the container hang, consuming 100% CPU. top shows the CPU usage as “system” usage, so they are hung inside the kernel somewhere. Because of this, they can’t be killed even with SIGKILL; we need to reboot the server.

We’re running Ubuntu 14.04.03, 64-bit, with kernel version 3.13.0-73. The Docker version is 1.9.1.

(Andy Balholm) #2

Could the problem be that SIGTERM arrived during a syscall that returns ERESTART??

(Sam Saffron) #3

Is this reproducible?

(Andy Balholm) #4

Yes, although if we reproduce it we then need to reboot our server, which also hosts other things, so we don’t want to reproduce it too often.

(Sam Saffron) #5

Does docker stop cause it to hang as well? Can this be a docker bug?

(Andy Balholm) #6

Yes, it hangs with docker stop or service docker restart. docker kill works fine, as long as it isn’t hanging yet.

It could easily be a bug in Docker, or in runit.

(Sam Saffron) #7

unlikely runit, cause docker is meant to wait 10 seconds and send a sigkill, stop is always meant to stop containers, can you open a ticket with docker on github?

(Andy Balholm) #8

(marioMAN) #9

Hello guys

This keeps happening

./launcher stop forum0:

+ /usr/bin/docker stop -t 10 forum0

27879 root 20 0 168 4 0 R 100,0 0,0 8:28.82 runsv
27881 root 20 0 168 4 0 R 100,0 0,0 8:29.09 runsv

Ubuntu 14.04
Linux localhost 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Docker version 1.9.1, build a34a1d5

I checked the github thread but it seems to be going nowhere
any ideas? work around?


(Andy Balholm) #10

Once it’s happened, you need to reboot.

To avoid it happening, use docker kill instead of docker stop.

(marioMAN) #11

docker kill forum0

4676 root 20 0 168 4 0 R 99.8 0.0 5:59.47 runsv :frowning:

(Andy Balholm) #12

Did you try to kill it after you had already stopped it? Docker kill (and even kill -9) doesn’t work on containers that are already frozen like this.

(marioMAN) #13

No I restarted the machine.
I have a jenkins job that builds the forum each hour

	pushd /var/discourse
                #sudo ./laucher restart forum0 
		sudo docker kill forum0 || true
		sleep 20
		sudo ./launcher start forum0

and thats the change I did to the script, but after 2 runs, same issue.

(Luke) #14

I read over the github issue, evidently it’s an issue with Aufs + whatever Linux kernel ubuntu is using? I was running my small gaming community on Debian boxes, which ran discourse fine. I had decided to migrate over to Ubuntu, and noticed that when I went to rebuild my container this issue occurred. Hopefully this gets traction somewhere, it’s a pretty big issue.

I am also interested in a work around, I’ve been a fan of discourse since the beginning, now I’m forced with the decision to either migrate back to debian stable, move to another software (not fun), or figure out another solution.

That said, is this even an issue with Discourse, or purely docker itself?

(Jeff Atwood) #15

Never heard of this issue and we run Ubuntu 14.04 lts exclusively across hundreds of Discourse installs, all in Docker containers. So I think your conclusion is a bit questionable…

(Luke) #16

When I first started running Discourse I also ran Ubuntu 14.04, exclusively, without issue up until about 2 months ago when I decided to switch to Debian.

After realizing that I prefer Ubuntu, mainly because I find it extremely predictable (for myself, I don’t want to get off-topic here). I decided to switch back.

I created a new discourse installation following this guide, which is what I’ve always done:

I realize I forgot to add a plugin, so I go to rebuild the new configuration, upon execution of ./launcher rebuild app, the command hangs. I open a new session, run htop, and am greeted with the same issue as @andybalholm.

I’m not sure where the issue lies, I’ll gladly provide you with more details if you just tell me what you need.

(Sam Saffron) #17

@tgxworld is this the issue you and @mpalmer were seeing? do we need to update our base image?

(Matt Palmer) #18

Ohhh, this one… fun times.

I tracked this one down for some internal images which were having problems. The issue, it turns out, is when you have a container PID 1 which doesn’t properly pass down signals to its children before exiting. Apparently, PID 1 is supposed to whack all its children – Docker won’t go chasing around the process namespace looking for things to whack, or something like that.

The solution is to make sure that PID 1 terminates and waits for all its children to exit before it terminates in response to a TERM signal. It also needs to do this “recursively”, because when it signals its children, they may not kill their children properly, so those children will be reparented to PID 1, so PID 1 needs to whack them, and so on and so forth.

I believe that tini handles all this sort of thing properly, and I strongly recommend we integrate it into our base images and use it everywhere. However, at the very least, we should audit how the existing boot process works, and fix it so it catches and re-propagates signals to (for example) runsvdir (which, when it gets a TERM, correctly terminates all the runsv-mediated process trees under it).


I’m having this issue on Ubuntu 15.10, latest version of discourse. :frowning:
My workaround was to do a “docker kill” after a reboot, and then rebuild.
Hope it helps someone else.

(Luke) #20

I’ll have to give this a go later today. Thanks. :wink: