Discourse did not restart after system crash -- aufs missing


(Daniel) #1

Hi there,

  • Linode let us know that there was an issue with our hardware and they were working on it.
  • When I logged into their dashboard we were powered off and the dashboard was locked.
  • They resolved the problem and the system booted up automatically.
  • Discourse didn’t start up.

Entered into our discourse directory (/var/docker … no I haven’t changed it yet… :flushed:) and ran ./launcher to check the available commands for starting it back up.

It seems that it went and pulled the latest from github and then ./launcher start app failed.

Running ./launcher rebuild app solved the problem but extended the duration of the outage.

I’m looking to learn from this outage how I could have resolved this without running ./launcher rebuild app, or was it necessary?

Cheers


danny:/var/docker$ ./launcher
Unable to find image 'samsaffron/discourse:1.0.7' locally
Pulling repository samsaffron/discourse
22d62951587e: Pulling image (1.0.7) from samsaffron/discourse
22d62951587e: Pulling image (1.0.7) from samsaffron/discourse, endpoint: https://registry-1.docker.io/v1/
22d62951587e: Pulling dependent layers
511136ea3c5a: Download complete
7e2a471a454a: Pulling metadata
7e2a471a454a: Pulling fs layer
7e2a471a454a: Download complete
cdb5237bc8a7: Pulling metadata
cdb5237bc8a7: Pulling fs layer
cdb5237bc8a7: Download complete
fa6d84c1e733: Pulling metadata
fa6d84c1e733: Pulling fs layer
fa6d84c1e733: Download complete
91cf3969bafa: Pulling metadata
91cf3969bafa: Pulling fs layer
91cf3969bafa: Download complete
22d62951587e: Pulling metadata
22d62951587e: Pulling fs layer
22d62951587e: Download complete
22d62951587e: Download complete
Status: Downloaded newer image for samsaffron/discourse:1.0.7
Usage: launcher COMMAND CONFIG [--skip-prereqs]
Commands:
    start:      Start/initialize a container
    stop:       Stop a running container
    restart:    Restart a container
    destroy:    Stop and remove a container
    enter:      Use nsenter to enter a container
    ssh:        Start a bash shell in a running container
    logs:       Docker logs for container
    mailtest:   Test the mail settings in a container
    bootstrap:  Bootstrap a container for the config based on a template
    rebuild:    Rebuild a container (destroy old, bootstrap, start new)

Options:
    --skip-prereqs   Don't check prerequisites
    --docker-args    Extra arguments to pass when running docker
danny:/var/docker$ ./launcher start app
Invalid cid file, deleting, please re-run
danny:/var/docker$ ./launcher start app
No cid found, creating a new container
Calculated ENV: -e LANG=en_GB.UTF-8 -e HOME=/root -e RAILS_ENV=production -e UNICORN_WORKERS=2 -e UNICORN_SIDEKIQS=1 -e RUBY_GC_MALLOC_LIMIT=40000000 -e RUBY_HEAP_MIN_SLOTS=800000 -e DISCOURSE_DB_SOCKET=/var/run/postgresql -e DISCOURSE_DB_HOST= -e DISCOURSE_DB_PORT= -e DISCOURSE_DEVELOPER_EMAILS=xxx -e DISCOURSE_HOSTNAME=xxx -e DISCOURSE_SMTP_ADDRESS=xxx -e DISCOURSE_SMTP_PORT=587 -e DISCOURSE_SMTP_USER_NAME=xxx -e DISCOURSE_SMTP_PASSWORD=xxx
a4589c35cbb79f71f1bbd39e10f79788424f37892b350409cb5eec3f819f63db
FATA[0000] Error response from daemon: Cannot start container a4589c35cbb79f71f1bbd39e10f79788424f37892b350409cb5eec3f819f63db: Cannot find child for /app
danny:/var/docker$

danny:/var/docker$ docker --version
Docker version 1.4.1, build 5bc2ff8
danny:/var/docker$ docker info
Containers: 1
Images: 19
Storage Driver: devicemapper
 Pool Name: docker-202:0-41418-pool
 Pool Blocksize: 65.54 kB
 Data file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata file: /var/lib/docker/devicemapper/devicemapper/metadata
 Data Space Used: 3.753 GB
 Data Space Total: 107.4 GB
 Metadata Space Used: 3.277 MB
 Metadata Space Total: 2.147 GB
 Library Version: 1.02.82-git (2013-10-04)
Execution Driver: native-0.2
Kernel Version: 3.13.0-44-generic
Operating System: Ubuntu 14.04.1 LTS
CPUs: 1
Total Memory: 990 MiB
Name: xxx
ID: RII5:S6VC:V5GE:23Z4:YXPL:6KJ6:X454:CFBU:WFS5:UL42:YU5P:47T2
danny:/var/docker$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.1 LTS
Release:        14.04
Codename:       trusty

(Jeff Atwood) #2

Hard to say, were you on an old version of Docker? There were startup issues in several older versions of Docker.

I reboot Digital Ocean droplets regularly and have never had a problem with Discourse coming up.

I suspect there was some hardware issue, since the Linode was down hard?


(Sam Saffron) #3

devicemapper is a nightmare, only had issues with it, recommend you get aufs going.

Will double check there have been no regressions here in 1.4.1 though.


(Daniel) #4

I’ve been told by Linode that the server was shut down cleanly to perform maintenance on the physical host.

@sam, good spot with devicemapper. Pretty sure that said aufs when I last checked a few days ago after upgrading Docker (and certainly did when I first set it up). Will try to figure out why it has switched over to devicemapper tomorrow.


(Daniel) #5

At some point I must have updated the kernel when installing packages. The related linux-image-extra-* package wasn’t automatically installed with it so on reboot I no longer had the aufs module.

This brought aufs back

$ sudo apt-get install linux-image-extra-`uname -r`
$ sudo modprobe aufs

This package depends on the latest linux-image-extra-* package so it should be installed automatically with later kernels

$ sudo apt-get install linux-image-generic

Stopped discourse

$ ./launcher stop app

Restarted docker

$ sudo service docker restart
$ docker info
Containers: 1
Images: 39
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Dirs: 41
Execution Driver: native-0.2
Kernel Version: 3.13.0-44-generic
Operating System: Ubuntu 14.04.1 LTS
CPUs: 1
Total Memory: 990 MiB
Name: hostname
ID: RII5:S6VC:V5GE:23Z4:YXPL:6KJ6:X454:CFBU:WFS5:UL42:YU5P:47T2

Rebuilding failed

$ ./launcher rebuild app
Updating discourse docker
Already up-to-date.
Stopping old container
Error response from daemon: No such container: 07327361d512ca847af5aca25899fd5fcba3c08c858b299111333e7e20d2a609
FATA[0000] Error: failed to stop one or more containers
Calculated ENV: -e LANG=en_GB.UTF-8 -e HOME=/root -e RAILS_ENV=production -e UNICORN_WORKERS=2 -e UNICORN_SIDEKIQS=1 -e RUBY_GC_MALLOC_LIMIT=40000000 -e RUBY_HEAP_MIN_SLOTS=800000 -e DISCOURSE_DB_SOCKET=/var/run/postgresql -e DISCOURSE_DB_HOST= -e DISCOURSE_DB_PORT= -e DISCOURSE_DEVELOPER_EMAILS=xxx -e DISCOURSE_HOSTNAME=xxx -e DISCOURSE_SMTP_ADDRESS=xxx -e DISCOURSE_SMTP_PORT=587 -e DISCOURSE_SMTP_USER_NAME=xxx -e DISCOURSE_SMTP_PASSWORD=xxx
cd /pups && git pull && /pups/bin/pups --stdin
Already up-to-date.
I, [2015-01-28T08:34:07.059442 #43]  INFO -- : Loading --stdin
I, [2015-01-28T08:34:07.068127 #43]  INFO -- : > echo cron is now included in base image, remove from templates
I, [2015-01-28T08:34:07.084485 #43]  INFO -- : cron is now included in base image, remove from templates

I, [2015-01-28T08:34:07.085856 #43]  INFO -- : > echo rsyslog template is included in base image, remove
I, [2015-01-28T08:34:07.099954 #43]  INFO -- : rsyslog template is included in base image, remove

I, [2015-01-28T08:34:07.101276 #43]  INFO -- : > mkdir -p /shared/postgres_run
I, [2015-01-28T08:34:07.117345 #43]  INFO -- :
I, [2015-01-28T08:34:07.121109 #43]  INFO -- : > chown postgres:postgres /shared/postgres_run
I, [2015-01-28T08:34:07.135823 #43]  INFO -- :
I, [2015-01-28T08:34:07.137190 #43]  INFO -- : > chmod 775 /shared/postgres_run
I, [2015-01-28T08:34:07.150655 #43]  INFO -- :
I, [2015-01-28T08:34:07.151976 #43]  INFO -- : > rm -fr /var/run/postgresql
I, [2015-01-28T08:34:07.167776 #43]  INFO -- :
I, [2015-01-28T08:34:07.169626 #43]  INFO -- : > ln -s /shared/postgres_run /var/run/postgresql
I, [2015-01-28T08:34:07.188824 #43]  INFO -- :
I, [2015-01-28T08:34:07.190255 #43]  INFO -- : > socat /dev/null UNIX-CONNECT:/shared/postgres_run/.s.PGSQL.5432 || exit 0 && echo postgres already running stop container ; exit 1
I, [2015-01-28T08:34:07.215357 #43]  INFO -- : postgres already running stop container



FAILED
--------------------
RuntimeError: socat /dev/null UNIX-CONNECT:/shared/postgres_run/.s.PGSQL.5432 || exit 0 && echo postgres already running stop container ; exit 1 failed with return #<Process::Status: pid 52 exit 1>
Location of failure: /pups/lib/pups/exec_command.rb:105:in `spawn'
exec failed with the params "socat /dev/null UNIX-CONNECT:/shared/postgres_run/.s.PGSQL.5432 || exit 0 && echo postgres already running stop container ; exit 1"
38f442b2394c514d4335a3682327f509c55d60d9ca4e00e2006da51797bfd181
FAILED TO BOOTSTRAP

Stopped an existing container (I guess the previous container?)

$ docker ps
CONTAINER ID        IMAGE                        COMMAND             CREATED             STATUS              PORTS                                        NAMES
0d03ee9bd474        local_discourse/app:latest   "/sbin/boot"        5 days ago          Up 47 seconds       0.0.0.0:2222->22/tcp, 0.0.0.0:8010->80/tcp   mad_lovelace
$ docker stop 0d03ee9bd474c993a25b0303e6134f122f96a725197b35596ff3233bc1311302
0d03ee9bd474c993a25b0303e6134f122f96a725197b35596ff3233bc1311302

Rebuilt the container

$ ./launcher rebuild app
Updating discourse docker
Already up-to-date.
Stopping old container
Error response from daemon: No such container: 07327361d512ca847af5aca25899fd5fcba3c08c858b299111333e7e20d2a609
FATA[0000] Error: failed to stop one or more containers

Rebuilding continued as normal after this until

[197] 28 Jan 08:43:40.710 # User requested shutdown...
[197] 28 Jan 08:43:40.710 * Saving the final RDB snapshot before exiting.
ca96a2a870da6ca9290a0d4f62c7e8de9acdceb1cb19959c846c46936a7bed9f
92ee02472c395f8774ef40ab3e26f7dd99062b57059fc2c1c6c5cb0abf0ea21c
Error response from daemon: No such container: 07327361d512ca847af5aca25899fd5fcba3c08c858b299111333e7e20d2a609
FATA[0000] Error: failed to remove one or more containers
Invalid cid file, deleting, please re-run

Tried to start

$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
$ ./launcher start app
No cid found, creating a new container
Calculated ENV: -e LANG=en_GB.UTF-8 -e HOME=/root -e RAILS_ENV=production -e UNICORN_WORKERS=2 -e UNICORN_SIDEKIQS=1 -e RUBY_GC_MALLOC_LIMIT=40000000 -e RUBY_HEAP_MIN_SLOTS=800000 -e DISCOURSE_DB_SOCKET=/var/run/postgresql -e DISCOURSE_DB_HOST= -e DISCOURSE_DB_PORT= -e DISCOURSE_DEVELOPER_EMAILS=xxx -e DISCOURSE_HOSTNAME=xxx -e DISCOURSE_SMTP_ADDRESS=xxx -e DISCOURSE_SMTP_PORT=587 -e DISCOURSE_SMTP_USER_NAME=xxx -e DISCOURSE_SMTP_PASSWORD=xxx
9302bfdc48b8ea4f0f2a697e0ed7d3e3d39d9c7bbc022efd4f613d0969b2eec4
FATA[0000] Error response from daemon: Cannot start container 9302bfdc48b8ea4f0f2a697e0ed7d3e3d39d9c7bbc022efd4f613d0969b2eec4: Cannot find child for /app

:fearful: :fearful:

$ docker ps -a
CONTAINER ID        IMAGE                        COMMAND             CREATED             STATUS                       PORTS               NAMES
9302bfdc48b8        local_discourse/app:latest   "/sbin/boot"        28 seconds ago
0d03ee9bd474        afbcbd997e14                 "/sbin/boot"        5 days ago          Exited (143) 9 minutes ago                       mad_lovelace
$ ls cids
app.cid
$ cat cids/app.cid
9302bfdc48b8ea4f0f2a697e0ed7d3e3d39d9c7bbc022efd4f613d0969b2eec4
$ docker stop `docker ps -aq`
9302bfdc48b8
0d03ee9bd474
$ docker rm `docker ps -aq`
9302bfdc48b8
0d03ee9bd474
$ docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
$ ./launcher start app
Invalid cid file, deleting, please re-run
$ ./launcher start app
No cid found, creating a new container
Calculated ENV: -e LANG=en_GB.UTF-8 -e HOME=/root -e RAILS_ENV=production -e UNICORN_WORKERS=2 -e UNICORN_SIDEKIQS=1 -e RUBY_GC_MALLOC_LIMIT=40000000 -e RUBY_HEAP_MIN_SLOTS=800000 -e DISCOURSE_DB_SOCKET=/var/run/postgresql -e DISCOURSE_DB_HOST= -e DISCOURSE_DB_PORT= -e DISCOURSE_DEVELOPER_EMAILS=xxx -e DISCOURSE_HOSTNAME=xxx -e DISCOURSE_SMTP_ADDRESS=xxx -e DISCOURSE_SMTP_PORT=587 -e DISCOURSE_SMTP_USER_NAME=xxx -e DISCOURSE_SMTP_PASSWORD=xxx
e1972f337b92923b903c987fcec8bfe0083f5d5f4a42234250288cf9b668a518
$ docker info
Containers: 1
Images: 39
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Dirs: 41
Execution Driver: native-0.2
Kernel Version: 3.13.0-44-generic
Operating System: Ubuntu 14.04.1 LTS
CPUs: 1
Total Memory: 990 MiB
Name: hostname
ID: RII5:S6VC:V5GE:23Z4:YXPL:6KJ6:X454:CFBU:WFS5:UL42:YU5P:47T2
$

:sweat_smile:

It seems “docker rm `docker ps -aq`” was the necessary thing here at some point. I’m not familiar with Docker so quite pleased with the outcome.


Ubuntu updates intefere with Docker and AUFS?
Lost all production data after upgrading (using Docker)
Ubuntu updates intefere with Docker and AUFS?
(Michael Downey) #6

After switching back to aufs recovering from the same problem shown above, I now have a huge devicemapper data file last touched on the day I switched back.

Should this be safe to delete now?

524560 4178576 -rw------- 1 root root 107374182400 Feb 9 15:34 /var/lib/docker/devicemapper/devicemapper/data


(Sam Saffron) #7

yeah … you can nuke that.


(Marco) #8

I have the same problem and I’m stuck.

I simply did:

sudo apt-get update
sudo apt-get dist-upgrade
sudo reboot

After the reboot, docker does not start:

sudo service docker status

The error is something about aufs.

time=“2017-01-17T07:47:09.955350610+01:00” level=error msg="[graphdriver] prior storage driver “aufs” failed: driver not supported"

I think the autoclean command deleted some important linux images? But I really don’t know what I’m talking about.

uname -r
3.13.0-32-generic

dpkg --get-selections | grep linux-image
linux-image-4.4.0-59-generic                    install
linux-image-extra-4.4.0-59-generic              install

Strange, no?

Please help!


(Sam Saffron) #9

For 4.4 you are going to need to use overlay2, I think btrfs is deprecated beyond 4.2 or something.


(Jay Pfaffman) #10

What OS and hosting service is this?

If it’s something like digital ocean, I’d recommend just cranking up a fresh droplet.


(Marco) #11
dockerd --storage-driver=overlay2
INFO[0000] libcontainerd: new containerd process, pid: 1032
WARN[0000] containerd: low RLIMIT_NOFILE changing to max  current=1024 max=65536
ERRO[0001] 'overlay' not found as a supported filesystem on this host. Please ensure kernel is new enough and has overlay support loaded.
FATA[0001] Error starting daemon: error initializing graphdriver: driver not supported

How do I ensure that overlay support is loaded?

Yes it is DO. Will I loose all my discourse installation?


(Sam Saffron) #12

Maybe read this carefully: Use the OverlayFS storage driver | Docker Documentation

try to see if you have overlay or overlay2, either should work (and should work much better than graphdriver)


(Marco) #13

I don’t have any overlay!

I find strange that the linux kernel versions don’t match:

uname -r
3.13.0-32-generic

dpkg --get-selections | grep linux-image
linux-image-4.4.0-59-generic                    install
linux-image-extra-4.4.0-59-generic              install

Is this normal?


(Sam Saffron) #14

well looks like your install is all mucked up.

On digital ocean upgrading a kernel is a mission and a half, you can’t just install the packages the hypervisor chooses it and there is a magic UI for it.

What you want to do is make sure you have the matching kernel packages installed to the kernel you are actually running.


(Jay Pfaffman) #15

If you’ve got a recent backup, then you can just

  1. create new droplet
  • restore backup to new droplet
  • change DNS to point to new droplet
  • make sure everything is cool
  • delete old droplet

(Marco) #16

Problem is that my Discourse instance lived in a Docker container and I can’t access Docker anymore. I can’t do:

sudo ./launcher enter app

in the old droplet. Is there a way to copy the whole Discourse as a package in a new droplet and try it there? Like if it was a virtual machine, I mean.


(Jeff Atwood) #17

Database is stored outside Docker.


(Marco) #18

Ok, if I try:

pg_dump -xOf /shared/discourse-backup.sql -d discourse -n public

I get:
pg_dump: [archiver (db)] connection to database "discourse" failed: could not connect to server: No such file or directory


(Jay Pfaffman) #19

You might copy /var/discourse to a new droplet.


(Marco) #20

IT WORKS!
Thank you guys, this has been hard :slight_smile: :slight_smile: I’m so happy I could recover my installation.