Our server was showing some users 500 errors, so I went to update Discourse to the latest version (I hadn’t updated in a month or so).
The update failed. It stopped for about 20 or 30 minutes when installing a Redis gem, so I tried SSH’ing in and pulling master from discourse-docker.
After doing that, I ran the launcher. It had this /usr/bin/docker stop -t 10 app command that I think was making the server wait for 10 minutes before attempting to relaunch. I didn’t want our forum to be down for a full 10 minutes, so I just power cycled our digital ocean droplet, which in the past has brought the forum back up fine. Unfortunately, it didn’t work this time. Here is part of the log:
We have almost 10 gigs free. Is there some sort of swap file that would take up more than that?
I could go in and manually delete one or more of the backups (we keep 2 or 3 days backups, but we probably don’t need to since we also send them to s3). Do you know where these backups would be located?
Thanks. The backups are only a few hundred megabytes.
I am trying to run ./launcher rebuild app again and It hangs on this:
/var/discourse# ./launcher rebuild app
Ensuring launcher is up to date
Fetching origin
Launcher is up-to-date
Stopping old container
+ /usr/bin/docker stop -t 10 app
Less than 10 seconds, if you have a long running transation on the DB this can be problematic (we had this a few commits ago), you can log and try to find the problematic PID and kill it.
Well, there’s ./launcher start app, which will just fire up the already-built container if it isn’t already running, but your container appears somewhat hosed, based on the consumption of a full core by each of two runsv processes. I’d jump straight into strace -p 1428 -f -s 10000, but I’m weird like that.
@sam Awesome! Thank you for helping with this, and for keeping me posted. We’ve started the process of creating a new instance from a restore just incase the old instance is unrecoverable. (though it may take hours to go through this - just want to minimize downtime.)
Attempted to kill container, had no luck … docker stop app was not working, machine was very hosed and eating up 100% cpu.
Attempted to kill the runaway perl script you had running, no idea where it came from
Tried rebooting from command line, had no luck you did a hard reboot.
Did a successful ./launcher rebuild app
Culprit of all this mess was that someone had a perl script running on your box that was eating 100% cpu, no idea what it was, it totally hosed stuff and refused to be killed even with a kill -9
@sam Thank you for the recap. I’ve linked to this thread for any campers who are curious about the outage.
If this ever happens again, I’ll remember to use ./launcher cleanup
I was scared to update docker in case there might be some breaking changes and you were stuck on an old version (I know very little about Docker).
Yes - it was crazy that I couldn’t kill that process. I tried everything, and rebooted several times and it kept coming back. Was it the runsv process? There were two of them running at one point eating up 90% CPU.