For no apparent reason, my server (Hetzner VPS) became virtually unreachable today, i.e. it was so busy that it took multiple attempts to login via ssh and even when it didn’t fail, it literally took minutes until I got a first response. Turns out, docker was running wild:
I ended up restarting the server via the Hetzner dashboard after I lost connection again. Luckily, I did not have to do a hard reset but
shutdown took ages (10 minutes?) until the server was actually down. When I restarted it, discourse was still not reachable (5xx error) so I rebuilt the app. After that, I first thought it was still not reachable but then the 5xx error went away and it’s working now. So I guess the container just needs some time to start up too and it might have worked without the rebuild.
In any case: what could be the reason for docker using so much CPU?
In case it matters,
docker info gives me this:
Server Version: 17.11.0-ce
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Default Runtime: runc
Init Binary: docker-init
containerd version: 992280e8e265f491f7a624ab82f3e238be086e49
runc version: 0351df1c5a66838d0c392b4ac4cf9450de844e2d
init version: 949e6fa
Kernel Version: 4.4.0-104-generic
Operating System: Ubuntu 16.04.3 LTS
Total Memory: 1.953GiB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Live Restore Enabled: false
WARNING: No swap limit support
And is there anything that can be done to prevent docker shutting down the entire server? I saw that docker can be limited in how much CPU it may use, but I’m not sure how to implement this with discourse and whether it is advisable at all, since it would artificially restrain the resources that discourse has access to…
What version of docker? @mpalmer found a nasty bug in the current version that will cause problems during backups.
API version: 1.34
Go version: go1.8.3
Git commit: 1caf76c
Built: Mon Nov 20 18:37:39 2017
API version: 1.34 (minimum version 1.12)
Go version: go1.8.3
Git commit: 1caf76c
Built: Mon Nov 20 18:36:09 2017
The bug that I found would produce the opposite of CPU consumption; it causes the container shim to stop doing anything useful.
docker info shows 17.11.0-ce, which is the buggy-af version, so you’ll want to downgrade, but not to fix this bug – this one’s a new one, and very definitely not Discourse-specific. If the problem persists, you’ll want to raise this on the Docker forums (it should feel familiar…), or if you can get a reliable reproduction of the problem, file a bug report.
How did I get that one? I don’t intentionally install experimental versions. All I do is
apt-get upgrade from time to time… Looks like one should be rather conservative in upgrading docker?
You probably got the “experimental” version because that’s what everyone gets if they initially installed via
curl https://get.docker.com | sh. You have to change
/etc/apt/sources.list.d/docker.list to track the
stable repo rather than
edge or whatever the default is.
Being conservative in upgrading docker isn’t even enough; we follow that policy (still on 17.06), and we still got hit by a ripper of a bug that exists in all versions from 17.06 to 17.11 (and, by the glacial pace at which the PR is moving, will probably be in 18.01 or whatever the next release will be).
Basically, Docker’s gonna Docker, and all we can do is hang on for the ride and try and dodge the worst of the low-hanging tree branches.
Thanks Matt! Downgrading was super easy, app is currently rebuilding and I assume that was the end of that trouble.
Really fantastic work that probably saved me countless hours and unbearable headaches, because a bug in docker is about the last thing I would assume, so I would just go crazy trying to understand what I’m doing wrong (and that’s particularly hard when you don’t even know what you’re doing right!).
I feel you – we spent the better part of a week blaming backups, sidekiq, and whatever else looked good before I managed to get a shell on a site that the owner was happy to leave down for hours whilst I rummaged around and got to the bottom of the problem.