Docker install failing on bootstrap - failed to add veth to sandbox


(Lee_Ars) #1

Migrating my Discourse install to a new hosted server and running into some serious head-smash-on-desk problems that I can’t figure out. On executing the launcher bootstrap command, I get this:

./launcher bootstrap app
/usr/bin/docker: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:334: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: time=\\\\\\\"2016-11-10T14:20:29-05:00\\\\\\\" level=fatal msg=\\\\\\\"failed to add interface veth7d2a024 to sandbox: failed to get link by name \\\\\\\\\\\\\\\"veth7d2a024\\\\\\\\\\\\\\\": Link not found\\\\\\\" \\\\n\\\"\"\n".
Your Docker installation is not working correctly

See: https://meta.discourse.org/t/docker-error-on-bootstrap/13657/18?u=sam

The relevant errors are the “failed to add interface vethxxx to sandbox” and “link not found” errors, i believe.

This is on a server running ubuntu 16.04 LTS, with the app.yml file and templates set up identically to the running instance from which I’m migrating (except for a change in the hostname).

edited to add - Using Docker 1.12.3, via the docker-engine package provided by Docker’s official repo.

I am using iptables (will attach my rules at the bottom), and some googling seems to reveal that Docker shits itself into a blind fury sometimes with iptables (representative error discussion, but there are many). So, I’ve already modified Docker with --iptables=false and bounced the server. Problem behavior is unaffected.

I’ve also followed this page’s advice and thrown in a pair of iptables rules to allow unrestricted traffic flow between eth0 and docker0. Problem behavior is unaffected.

I’ve tried flushing all iptables rules and bootstrapping again, both with and without ’ --iptables=false set for Docker. Problem behavior is unaffected.

One weird thing which may or may not matter is that the veth interface listed in the error message does not match any interfaces shown when I do an ifconfig. Every time the bootstrap fails I’m left with another orphaned veth interface, but none of them match the ones listed in each error.

Any assistance would be great. I am totally lost as to where to go from here, especially if this turns out to be some kind of stupid docker bug.

Current iptables rules:

:INPUT DROP [110:10149]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [65:13032]
:LOG_AND_DROP - [0:0]
-A FORWARD -i docker0 -o eth0 -j ACCEPT
-A FORWARD -i eth0 -o docker0 -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -i docker0 -j ACCEPT
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -m recent --set --name DEFAULT --mask 255.255.255.255 --rsource
-A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -m recent --update --seconds 60 --hitcount 4 --name DEFAULT --mask 255.255.255.255 --rsource -j LOG_AND_DROP
-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 80 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 443 -j ACCEPT
-A LOG_AND_DROP -j LOG --log-prefix "iptables rate deny: " --log-level 7
-A LOG_AND_DROP -j DROP

(Jeff Atwood) #2

Quite possible @mpalmer or @sam should advise.


(Lee_Ars) #3

Thanks, @codinghorror. Hopefully this’ll just be something simple.

edit-

Guh, this is almost certainly Docker-related. Running docker run hello-world craps out the same error:

# docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
c04b14da8d14: Pull complete 
Digest: sha256:0256e8a36e2070f7bf2d0b0763dbabdd67798512411de4cdcf9431a1feb60fd9
Status: Downloaded newer image for hello-world:latest
docker: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:334: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: time=\\\\\\\"2016-11-10T15:05:58-05:00\\\\\\\" level=fatal msg=\\\\\\\"failed to add interface veth8ba0d37 to sandbox: failed to get link by name \\\\\\\\\\\\\\\"veth8ba0d37\\\\\\\\\\\\\\\": Link not found\\\\\\\" \\\\n\\\"\"\n".

Gonna try to rip Docker out and start over, this time with the version in the canonical repo instead of docker’s. It’s just so powerfully frustrating when this kind of thing doesn’t work.

edit^2 - No joy, identical problem behavior. (Tried with all 7 versions of docker-engine listed in the Docker repo, as well as the 1.10.3 version in the canonical repo.)

Really seems to be tied with the fact that the veth mentioned in “failed to add interface vethnnnn to sandbox: failed to get link by name” doesn’t match the veth that’s actually being created and that i see with ifconfig.


(Matt Palmer) #4

Welcome to the shady underworld of Docker Mysteries.

As you say, this is definitely a Docker bug, but the fact that you’ve tried across multiple versions suggests it isn’t just a Docker bug – that is, there’s something else in the machine’s setup that’s causing problems. Browsing through the bugs that mention the error you’re getting, it looks like there’s a whole raft of possible causes, from appallingly bad endpoint protection software to bugs in certain kernels.

I see two ways forward. If you have l33t sk1llz in kernel hacking, you can watch the netlink messages flying around (I’m pretty sure tcpdump can capture them, from memory), and dig into the source of the running kernel to figure out why the messages aren’t doing what you might otherwise expect. The root problem might be in the kernel, or in Docker, but I’m pretty sure you’ll need to rummage around in the kernel to figure out what’s going on, anyway. The other option is to do a very careful comparison between the two machines you’re running – every package, every running process, every /proc/sys setting – heck, potentially the checksum of every binary – to figure out what’s different between the working and non-working machine.


(Lee_Ars) #5

Ugh. No, I am definitely not a kernel hacking wizard.

But…

…I have an idea. An insane idea. An idea so insane…it might just work.

Be back shortly.

edit - oh my god my plan might be working. fingers crossed.

edit^2 - holy shit, it worked.

I solved my problem by setting up a new LXD container/vm and setting docker & discourse up inside of that.

I feel like some kind of insane mad scientist right now, seriously. There are, like…layers of things happening here.


(Lee_Ars) #6

literally me right now


(Jay Pfaffman) #7

That’s very strange.

I install it this way:

wget -qO- https://get.docker.com/ | sh

and it works just fine. These days ./discourse-setup will install it for you.


(Lee_Ars) #8

Yeah, I tried the docker install script—all it effectively ends up doing is sniffing out your distro version and adding the right repo to your sources list. It’s the same thing as the manual process, just faster.

The discourse-setup script isn’t an option for me because it immediately bails if it senses you’ve got something else bound to port 80 (which I do, since there are a half-dozen other sites on this server).

Still, things are a lot better now than they used to be—we’re miles ahead of having to screw with passenger and 80 different conflicting versions of Ruby :slight_smile:


(Lee_Ars) #9

Resurrecting this issue—I haven’t been able to make any progress on this, in spite of months of on-and-off effort. I’ve gone through multiple kernel upgrades and am currently on 4.8.0, and nothing—literally nothing I’ve done—has made even the slightest difference in the problem behavior. As far as I can tell, I’m the only person on the whole damn internet who’s having this exact problem. Which really sucks, because it’s on a dedicated server in a colo datacenter far far away from me, so I can’t just start swapping hardware. And “swap out hardware randomly and load new drivers and see if the old drivers were just magically fucking stuff up” is where I am on my own at this point.

So, I’ve opened an issue for the Docker folks to look at, and I’m hopeful this will lead to resolution. Would love to know what the root cause is, and if I’m able to get the issue resolved w/the assistance of the github crew, I’ll post the solution here. Because I’ve been this guy before when looking for answers, many times, and I wouldn’t wish that on anyone.


(Eli the Bearded) #10

Are you seeing anything in other logs with timestamps that look like they are at that particular time? You say you have other things running on this particular system and I’m wondering if something else is winning the race to find the veth interface and doing something to it to make it unsuitable for Docker.


(Lee_Ars) #11

No, I’m not seeing anything on any logs, and believe me, I’ve looked until my eyeballs fell out.

I do have lots of other things running on the system, but Discourse was one of the very first things I tried installing when I first set this box up, and the issues manifested themselves even when the server had nothing else on it except the base OS (ubuntu 16.04 LTS) and Docker (installed via the regular Docker install script).

That’s the maddening thing. The system has been evolving pretty much continually as I’ve migrated more things onto it, and nothing has made a difference—even a giant kernel upgrade hasn’t affected the problem behavior. I’m at a total loss as to what could be causing it, unless it’s something about the particular hardware + driver mix, and since this is a hosted dedicated server in a datacenter across the country, there’s not much I can do about that.

The workaround of shoving Docker + Discourse into an LXC container and running it that way continues to be 100% functional, but I remain totally baffled and clueless about how to mitigate the problem. This ain’t my first troubleshooting rodeo, but damned if I know what to do next.


(Sam Saffron) #12

Ping me via PM, I don’t mind giving it a shot.


(Lee_Ars) #13

Can-do! I’ll hit you up on this and on the other thing as soon as i’m off this concall.


(Sam Saffron) #14

Making some progress here:

I do not think this is dockers fault I think this could be this systemd bug causing a side effect:

I particular seeing:

Jul  6 12:04:00 liquidity systemd-udevd[23622]: Could not generate persistent MAC address for veth1f42bcd: No such file or directory
Jul  6 12:04:00 liquidity systemd-udevd[23620]: Could not generate persistent MAC address for veth884f476: No such file or directory
Jul  6 12:04:00 liquidity systemd-udevd[23622]: error changing net interface name 'veth1f42bcd' to 'eth8': Device or resource busy

continuing some debugging here.


(Lee_Ars) #15

This is why it’s always a good idea for someone smarter than me to look at this kind of thing!


(Sam Saffron) #16

OK this is all due to systemd having some crazy feature called

Predictable Netowork Names

As soon as docker created a new virtual interface on the network bridge, the kernel would step in and rename the interface from vethxyz to ethN. This meant that as soon as docker tried to check back on the interface it just created it was no longer able to talk to it.

I disabled the feature by adding:

editing: /etc/default/grub
GRUB_CMDLINE_LINUX=“biosdevname=0 net.ifnames=0”

sudo update-grub

Previously GRUB_CMDLINE_LINUX was set to biosdevname=1

To fully debug this I enabled debug mode on docker and kept running

docker run --rm hello-world

(Lee_Ars) #17

It’s difficult to describe how happy I am to have this put to bed.

It’s also difficult to describe how much I hate systemd.

Thank you!!


(Sam Saffron) #18

You are not alone, @mpalmer has been holding off big time on upgrading us to 16.04, its going to happen one day but systemd sure does scare him.


(Eli the Bearded) #19

This is a “winning the race” type scenario I was thinking of, but I hadn’t thought to include the kernel in the participants.

My understanding is Predictable Network Names are normally enabled by default, and the vethxyz name is the “Predictable” name (as opposed to ethN, where N is the unpredictable sequence number).

Having a non-default config on a new feature is probably not common, hence the lack of anyone else finding this.

I’m running 16.04 on my laptop, and it has certainly surprised me from time to time. In particular, and this is a laptop specific issue, systemd crashing on recovery from sleep and then subsequent subtle and odd issues. Like the shutdown or reboot commands failing to do anything.


(Matt Palmer) #20

Pfft, systemd doesn’t scare me, I just loathe it. We’re not on 16.04 yet because there are more important things to do at the moment than upgrade from a perfectly adequate OS. Once we do upgrade, we still won’t be using systemd, anyway; I know the magicks to neuter systemd and all its evils (and brain-dead security bugs that Poettering refuses to acknowledge).