For no apparent reason, my server (Hetzner VPS) became virtually unreachable today, i.e. it was so busy that it took multiple attempts to login via ssh and even when it didn’t fail, it literally took minutes until I got a first response. Turns out, docker was running wild:
I ended up restarting the server via the Hetzner dashboard after I lost connection again. Luckily, I did not have to do a hard reset but shutdown took ages (10 minutes?) until the server was actually down. When I restarted it, discourse was still not reachable (5xx error) so I rebuilt the app. After that, I first thought it was still not reachable but then the 5xx error went away and it’s working now. So I guess the container just needs some time to start up too and it might have worked without the rebuild.
In any case: what could be the reason for docker using so much CPU?
And is there anything that can be done to prevent docker shutting down the entire server? I saw that docker can be limited in how much CPU it may use, but I’m not sure how to implement this with discourse and whether it is advisable at all, since it would artificially restrain the resources that discourse has access to…
The bug that I found would produce the opposite of CPU consumption; it causes the container shim to stop doing anything useful.
The docker info shows 17.11.0-ce, which is the buggy-af version, so you’ll want to downgrade, but not to fix this bug – this one’s a new one, and very definitely not Discourse-specific. If the problem persists, you’ll want to raise this on the Docker forums (it should feel familiar…), or if you can get a reliable reproduction of the problem, file a bug report.
How did I get that one? I don’t intentionally install experimental versions. All I do is apt-get upgrade from time to time… Looks like one should be rather conservative in upgrading docker?
You probably got the “experimental” version because that’s what everyone gets if they initially installed via curl https://get.docker.com | sh. You have to change /etc/apt/sources.list.d/docker.list to track the stable repo rather than edge or whatever the default is.
Being conservative in upgrading docker isn’t even enough; we follow that policy (still on 17.06), and we still got hit by a ripper of a bug that exists in all versions from 17.06 to 17.11 (and, by the glacial pace at which the PR is moving, will probably be in 18.01 or whatever the next release will be).
Basically, Docker’s gonna Docker, and all we can do is hang on for the ride and try and dodge the worst of the low-hanging tree branches.
Thanks Matt! Downgrading was super easy, app is currently rebuilding and I assume that was the end of that trouble.
Really fantastic work that probably saved me countless hours and unbearable headaches, because a bug in docker is about the last thing I would assume, so I would just go crazy trying to understand what I’m doing wrong (and that’s particularly hard when you don’t even know what you’re doing right!).
I feel you – we spent the better part of a week blaming backups, sidekiq, and whatever else looked good before I managed to get a shell on a site that the owner was happy to leave down for hours whilst I rummaged around and got to the bottom of the problem.