Nothing we can see in those logs.
Please check disk space, and if possible paste a log from a rebuild.
Nothing we can see in those logs.
Please check disk space, and if possible paste a log from a rebuild.
We have almost 10 gigs free. Is there some sort of swap file that would take up more than that?
I could go in and manually delete one or more of the backups (we keep 2 or 3 days backups, but we probably don’t need to since we also send them to s3). Do you know where these backups would be located?
By default at /var/discourse/shared/standalone/backups/default/
Thanks. The backups are only a few hundred megabytes.
I am trying to run ./launcher rebuild app again and It hangs on this:
/var/discourse# ./launcher rebuild app
Ensuring launcher is up to date
Fetching origin
Launcher is up-to-date
Stopping old container
+ /usr/bin/docker stop -t 10 app
Is this normal? How long does this usually take?
Less than 10 seconds, if you have a long running transation on the DB this can be problematic (we had this a few commits ago), you can log and try to find the problematic PID and kill it.
I ran top and saw this:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1428 root 20 0 4236 364 292 R 98.1 0.0 13:36.66 runsv
1429 root 20 0 4236 368 292 R 94.8 0.0 13:32.93 runsv
7 root 20 0 0 0 0 S 0.3 0.0 0:02.48 rcu_sched
7729 root 20 0 21960 1540 1100 R 0.3 0.0 0:00.08 top
1 root 20 0 33496 2804 1444 S 0.0 0.1 0:01.98 init
Is there another way to start discourse other than the ./launcher rebuild app command?
Well, there’s ./launcher start app
, which will just fire up the already-built container if it isn’t already running, but your container appears somewhat hosed, based on the consumption of a full core by each of two runsv
processes. I’d jump straight into strace -p 1428 -f -s 10000
, but I’m weird like that.
I restarted the droplet and now runsv isn’t running anymore, but it’s still not up and we’re still having the original problem with our logs.
I am having a look now…
perl is chewing 100% cpu on this machine …
upgraded docker and trying to reboot the box, then will have a look at cpu to see that whatever perl script you were running is properly not running.
@sam Awesome! Thank you for helping with this, and for keeping me posted. We’ve started the process of creating a new instance from a restore just incase the old instance is unrecoverable. (though it may take hours to go through this - just want to minimize downtime.)
I need you to reboot that server from the digital ocean panel, that perl script caused super havoc
reboot via command line did not work
I’ve power cycled it.
cool … old image is nuked, did a cleanup … it is bootstrapping just fine
OK - so it isn’t hanging at the /usr/bin/docker stop -t 10 app
like it was before? Does this mean we won’t need to restore everything?
I think you should be fine, it is precompiling assets now, should be done in a minute or 2
https://forum.freecodecamp.com/ is up.
A recap on what I did.
./launcher cleanup
to free up more space./launcher rebuild app
Culprit of all this mess was that someone had a perl script running on your box that was eating 100% cpu, no idea what it was, it totally hosed stuff and refused to be killed even with a kill -9
@sam Thank you for the recap. I’ve linked to this thread for any campers who are curious about the outage.
If this ever happens again, I’ll remember to use ./launcher cleanup
I was scared to update docker in case there might be some breaking changes and you were stuck on an old version (I know very little about Docker).
Yes - it was crazy that I couldn’t kill that process. I tried everything, and rebooted several times and it kept coming back. Was it the runsv
process? There were two of them running at one point eating up 90% CPU.