Emergency help - freeCodeCamp Discourse down

ossia · 22. Februar 2017 um 01:03

Our server was showing some users 500 errors, so I went to update Discourse to the latest version (I hadn’t updated in a month or so).

The update failed. It stopped for about 20 or 30 minutes when installing a Redis gem, so I tried SSH’ing in and pulling master from discourse-docker.

After doing that, I ran the launcher. It had this /usr/bin/docker stop -t 10 app command that I think was making the server wait for 10 minutes before attempting to relaunch. I didn’t want our forum to be down for a full 10 minutes, so I just power cycled our digital ocean droplet, which in the past has brought the forum back up fine. Unfortunately, it didn’t work this time. Here is part of the log:

https://gist.github.com/QuincyLarson/995f51e047e7e4188531f8a658faf99e

Any ideas what could be causing this?

Falco · 22. Februar 2017 um 01:09

Nothing we can see in those logs.

Please check disk space, and if possible paste a log from a rebuild.

ossia · 22. Februar 2017 um 01:11

We have almost 10 gigs free. Is there some sort of swap file that would take up more than that?

I could go in and manually delete one or more of the backups (we keep 2 or 3 days backups, but we probably don’t need to since we also send them to s3). Do you know where these backups would be located?

Falco · 22. Februar 2017 um 01:16

By default at /var/discourse/shared/standalone/backups/default/

ossia · 22. Februar 2017 um 01:21

Thanks. The backups are only a few hundred megabytes.

I am trying to run ./launcher rebuild app again and It hangs on this:

/var/discourse# ./launcher rebuild app
Ensuring launcher is up to date
Fetching origin
Launcher is up-to-date
Stopping old container
+ /usr/bin/docker stop -t 10 app

Is this normal? How long does this usually take?

Falco · 22. Februar 2017 um 01:22

Less than 10 seconds, if you have a long running transation on the DB this can be problematic (we had this a few commits ago), you can log and try to find the problematic PID and kill it.

ossia · 22. Februar 2017 um 01:24

I ran top and saw this:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                      
 1428 root      20   0    4236    364    292 R  98.1  0.0  13:36.66 runsv                                                        
 1429 root      20   0    4236    368    292 R  94.8  0.0  13:32.93 runsv                                                        
    7 root      20   0       0      0      0 S   0.3  0.0   0:02.48 rcu_sched                                                    
 7729 root      20   0   21960   1540   1100 R   0.3  0.0   0:00.08 top                                                          
    1 root      20   0   33496   2804   1444 S   0.0  0.1   0:01.98 init

Is there another way to start discourse other than the ./launcher rebuild app command?

mpalmer · 22. Februar 2017 um 01:35

Well, there’s ./launcher start app, which will just fire up the already-built container if it isn’t already running, but your container appears somewhat hosed, based on the consumption of a full core by each of two runsv processes. I’d jump straight into strace -p 1428 -f -s 10000, but I’m weird like that.

ossia · 22. Februar 2017 um 01:49

I restarted the droplet and now runsv isn’t running anymore, but it’s still not up and we’re still having the original problem with our logs.

sam · 22. Februar 2017 um 02:02

I am having a look now…

sam · 22. Februar 2017 um 02:08

perl is chewing 100% cpu on this machine …

upgraded docker and trying to reboot the box, then will have a look at cpu to see that whatever perl script you were running is properly not running.

ossia · 22. Februar 2017 um 02:19

@sam Awesome! Thank you for helping with this, and for keeping me posted. We’ve started the process of creating a new instance from a restore just incase the old instance is unrecoverable. (though it may take hours to go through this - just want to minimize downtime.)

sam · 22. Februar 2017 um 02:20

I need you to reboot that server from the digital ocean panel, that perl script caused super havoc

reboot via command line did not work

ossia · 22. Februar 2017 um 02:21

I’ve power cycled it.

sam · 22. Februar 2017 um 02:23

cool … old image is nuked, did a cleanup … it is bootstrapping just fine

ossia · 22. Februar 2017 um 02:25

OK - so it isn’t hanging at the /usr/bin/docker stop -t 10 app like it was before? Does this mean we won’t need to restore everything?

sam · 22. Februar 2017 um 02:26

I think you should be fine, it is precompiling assets now, should be done in a minute or 2

sam · 22. Februar 2017 um 02:34

https://forum.freecodecamp.com/ is up.

A recap on what I did.

upgrade docker
./launcher cleanup to free up more space
Attempted to kill container, had no luck … docker stop app was not working, machine was very hosed and eating up 100% cpu.
Attempted to kill the runaway perl script you had running, no idea where it came from
Tried rebooting from command line, had no luck you did a hard reboot.
Did a successful ./launcher rebuild app

Culprit of all this mess was that someone had a perl script running on your box that was eating 100% cpu, no idea what it was, it totally hosed stuff and refused to be killed even with a kill -9

ossia · 22. Februar 2017 um 02:43

@sam Thank you for the recap. I’ve linked to this thread for any campers who are curious about the outage.

If this ever happens again, I’ll remember to use ./launcher cleanup

I was scared to update docker in case there might be some breaking changes and you were stuck on an old version (I know very little about Docker).

Yes - it was crazy that I couldn’t kill that process. I tried everything, and rebooted several times and it kept coming back. Was it the runsv process? There were two of them running at one point eating up 90% CPU.

Thema		Antworten	Aufrufe
Need support with discourse Update Installation	4	102	9. Oktober 2024
Forum Full Crash (Test Pressing) Installation	9	141	4. September 2024
Discourse upgrade via Web UI Fails & SSH Upgrade Brings Down Discourse Instance Installation	17	1912	26. November 2021
Rebuild taking ~3hours Installation server-resources	35	1582	16. März 2025
PostgreSQL Stuck During Rebuilding Installation	52	977	4. Oktober 2024

Emergency help - freeCodeCamp Discourse down

Verwandte Themen