Unicorn using 100% cpu, can't kill -9


(Jeremy Howard) #1

The last 3 nights I’ve woken to find my forums down. The previous 2 nights it was out of memory (I’m using a Digital Ocean 2GB machine, but don’t have a very busy community so figured it would be plenty), and then this morning it still had 350MB free but instead one unicorn process was stuck at 100% CPU. Attempting to kill -9 it had to effect.

I updated to the most recent Discourse version yesterday.

Can anyone provide some guidance on how to debug this problem? unicorn.stderr.log shows lots of lines like this:

E, [2017-04-08T14:17:55.783343 #65] ERROR -- : worker=2 PID:12278 timeout (14487s > 30s), killing
E, [2017-04-08T14:17:55.783449 #65] ERROR -- : worker=0 PID:12291 timeout (14487s > 30s), killing
E, [2017-04-08T14:17:56.785481 #65] ERROR -- : worker=1 PID:11719 timeout (14914s > 30s), killing
E, [2017-04-08T14:17:56.786330 #65] ERROR -- : worker=2 PID:12278 timeout (14488s > 30s), killing
E, [2017-04-08T14:17:56.786586 #65] ERROR -- : worker=0 PID:12291 timeout (14488s > 30s), killing
E, [2017-04-08T14:17:57.788099 #65] ERROR -- : worker=1 PID:11719 timeout (14915s > 30s), killing
E, [2017-04-08T14:17:57.788313 #65] ERROR -- : worker=2 PID:12278 timeout (14489s > 30s), killing
E, [2017-04-08T14:17:57.788362 #65] ERROR -- : worker=0 PID:12291 timeout (14489s > 30s), killing

(Sam Saffron) #2

Did you do a rebuild recently? Are you on latest docker?


(Jeremy Howard) #3

When I updated I did

cd /var/discourse && git pull && ./launcher rebuild app

I haven’t done anything to manually update docker - I don’t know whether the steps above does that automatically however…

I just checked - ‘apt upgrade’ showed an update for docker-engine was needed. I just did that now.


(Sam Saffron) #4

How is disk space doing? How is memory?


(Jeremy Howard) #5

As I mentioned, I had 350MB free out of 2GB (and have now just increased my RAM to 4GB). Disk space is >50% free.


(Matt Palmer) #6

A process that can’t be kill -9'd is a rare and anomalous beast indeed. It most commonly indicates some sort of uninterruptible I/O. strace might show what I/O is being done, and an lsof should at least show which files could be involved. Are you, by any chance, using DO’s remote block storage product? If not, I’m inclined to think that the physical machine you’re on could be having some serious aneurysms.


(Jeremy Howard) #7

Thanks @mpalmer. I’m not using remote block storage. If it happens again I’ll try launching a new instance. I guess that I should just follow the steps in Move your Discourse Instance to a Different Server to do that…


(Jeremy Howard) #8

I moved over to a new server yesterday and got through the night successfully. Fingers crossed…