Emergency help - freeCodeCamp Discourse down

Our server was showing some users 500 errors, so I went to update Discourse to the latest version (I hadn’t updated in a month or so).

The update failed. It stopped for about 20 or 30 minutes when installing a Redis gem, so I tried SSH’ing in and pulling master from discourse-docker.

After doing that, I ran the launcher. It had this /usr/bin/docker stop -t 10 app command that I think was making the server wait for 10 minutes before attempting to relaunch. I didn’t want our forum to be down for a full 10 minutes, so I just power cycled our digital ocean droplet, which in the past has brought the forum back up fine. Unfortunately, it didn’t work this time. Here is part of the log:

https://gist.github.com/QuincyLarson/995f51e047e7e4188531f8a658faf99e

Any ideas what could be causing this?

Nothing we can see in those logs.

Please check disk space, and if possible paste a log from a rebuild.

We have almost 10 gigs free. Is there some sort of swap file that would take up more than that?

I could go in and manually delete one or more of the backups (we keep 2 or 3 days backups, but we probably don’t need to since we also send them to s3). Do you know where these backups would be located?

By default at /var/discourse/shared/standalone/backups/default/

Thanks. The backups are only a few hundred megabytes.

I am trying to run ./launcher rebuild app again and It hangs on this:

/var/discourse# ./launcher rebuild app
Ensuring launcher is up to date
Fetching origin
Launcher is up-to-date
Stopping old container
+ /usr/bin/docker stop -t 10 app

Is this normal? How long does this usually take?

Less than 10 seconds, if you have a long running transation on the DB this can be problematic (we had this a few commits ago), you can log and try to find the problematic PID and kill it.

I ran top and saw this:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                      
 1428 root      20   0    4236    364    292 R  98.1  0.0  13:36.66 runsv                                                        
 1429 root      20   0    4236    368    292 R  94.8  0.0  13:32.93 runsv                                                        
    7 root      20   0       0      0      0 S   0.3  0.0   0:02.48 rcu_sched                                                    
 7729 root      20   0   21960   1540   1100 R   0.3  0.0   0:00.08 top                                                          
    1 root      20   0   33496   2804   1444 S   0.0  0.1   0:01.98 init     

Is there another way to start discourse other than the ./launcher rebuild app command?

Well, there’s ./launcher start app, which will just fire up the already-built container if it isn’t already running, but your container appears somewhat hosed, based on the consumption of a full core by each of two runsv processes. I’d jump straight into strace -p 1428 -f -s 10000, but I’m weird like that.

I restarted the droplet and now runsv isn’t running anymore, but it’s still not up and we’re still having the original problem with our logs.

I am having a look now…

perl is chewing 100% cpu on this machine …

upgraded docker and trying to reboot the box, then will have a look at cpu to see that whatever perl script you were running is properly not running.

4 Likes

@sam Awesome! Thank you for helping with this, and for keeping me posted. We’ve started the process of creating a new instance from a restore just incase the old instance is unrecoverable. (though it may take hours to go through this - just want to minimize downtime.)

I need you to reboot that server from the digital ocean panel, that perl script caused super havoc

reboot via command line did not work

I’ve power cycled it.

cool … old image is nuked, did a cleanup … it is bootstrapping just fine

1 Like

OK - so it isn’t hanging at the /usr/bin/docker stop -t 10 app like it was before? Does this mean we won’t need to restore everything?

I think you should be fine, it is precompiling assets now, should be done in a minute or 2

https://forum.freecodecamp.com/ is up.

A recap on what I did.

  1. upgrade docker
  2. ./launcher cleanup to free up more space
  3. Attempted to kill container, had no luck … docker stop app was not working, machine was very hosed and eating up 100% cpu.
  4. Attempted to kill the runaway perl script you had running, no idea where it came from
  5. Tried rebooting from command line, had no luck you did a hard reboot.
  6. Did a successful ./launcher rebuild app

Culprit of all this mess was that someone had a perl script running on your box that was eating 100% cpu, no idea what it was, it totally hosed stuff and refused to be killed even with a kill -9

8 Likes

@sam Thank you for the recap. I’ve linked to this thread for any campers who are curious about the outage.

If this ever happens again, I’ll remember to use ./launcher cleanup

I was scared to update docker in case there might be some breaking changes and you were stuck on an old version (I know very little about Docker).

Yes - it was crazy that I couldn’t kill that process. I tried everything, and rebooted several times and it kept coming back. Was it the runsv process? There were two of them running at one point eating up 90% CPU.