Help! Discourse workers dying on our community site


(Dan Cunningham) #1

HI, I’m a member of the openHAB team, we are one of the largest open source home automation projects. We are also affiliated with the Eclipse Smart Home project which we contribute to, and use as the base of our own project (think of us as the reference implementation) .

Our community forums have been down for the last 24 hours and we are struggling to get them back online.

The Issue

Unicorn processes seem to be killed after 30 seconds, we see this in shared/standalone/log/rails/unicorn.stderr.log

E, [2017-09-21T14:29:18.466973 #261] ERROR -- : worker=0 PID:13791 timeout (43s > 30s), killing
E, [2017-09-21T14:29:18.467612 #261] ERROR -- : worker=2 PID:13823 timeout (36s > 30s), killing
E, [2017-09-21T14:29:18.467874 #261] ERROR -- : worker=1 PID:14395 timeout (33s > 30s), killing
E, [2017-09-21T14:29:18.547036 #261] ERROR -- : worker=0 PID:13791 timeout (43s > 30s), killing
E, [2017-09-21T14:29:18.547289 #261] ERROR -- : worker=1 PID:14395 timeout (33s > 30s), killing

Once this starts happening, we have to restart the container or this happens indefinitely. Most times it requires more then one restart to be functional again.

We doubled the size of our hardware 2 days ago, and now can not keep the site up for more then a few mins,

What we have done so far
For the last couple weeks this has been happening more frequently, and we have needed to restart it to bring it back online.

We were using a Digital Ocean 2 cpu / 4gb instance and made the decision to double that hoping that more mem and cpu will help handle the load.

After upgrading to 4 cpu / 8gb memory and rebuilding our container with an appropriate app.yml our problem seems to be worse now, not better. We still have the exact same errors, but now our site can’t seem to be up for more then a few minutes at a time. Our current settings look like the following

db_shared_buffers: "2GB"
db_work_mem: "40MB"
UNICORN_WORKERS: 4

I have also tried increasing the shared buffers, workers and a combination of both together, but nothing has seemed to help. Restarting no longer help as well.

Git log shows we are running revision ee9a607677106603ace3e90f3c5bd9ca84a11ef7 which was commited on Fri Sep 15 08:27:27 2017 +0800

Next Steps
We are currently out of ideas after doubling the hardware. I’m not sure why the unicorn processes are hanging and then being killed under load, and why doubling the hardware has made things worse, not better.

any help by the community would be very, very appreciated.

Thanks
Dan-


(Jay Pfaffman) #2

Any plugins installed?


(Dan Cunningham) #3

I can’t log in to the web interface, but this is what is listed in the plugin dir.

/var/www/discourse/plugins# ls
discourse-details	   discourse-nginx-performance-report	 discourse-solved   lazyYT
discourse-github-linkback  discourse-plugin-code-fences-buttons  discourse-tagging  plugins
discourse-narrative-bot    discourse-presence	

(Dan Cunningham) #4

Sorry, still new to discourse admin, and here is what is in our app.yml

          - git clone https://github.com/discourse/docker_manager.git
          - git clone https://github.com/discourse/discourse-tagging.git
          - git clone https://github.com/discourse/discourse-solved.git
          - git clone https://github.com/ThomDietrich/discourse-plugin-code-fences-buttons.git
          - git clone https://github.com/discourse/discourse-github-linkback.git

(Dan Cunningham) #5

We have disabled all plugins, but our problem still remains.


(Jeff Atwood) #6

@sam saw this recently with corrupt databases.


(Dan Cunningham) #7

All thank you for your responses! Just an FYI I was in the app looking at the DB (thanks @codinghorror for the link) , so I had done a sv stop unicorn and when I was done I did a sv start unicorn. When the unicorn server started back up our community site came back and now does not have any issues, and is very snappy to boot ! I had run vacuum full yesterday on the DB and it did not seem to help (or do anything, which didn’t surprise me as auto vacuum is on), but I did run that again between the start and stop. At this point I’m not sure what has happened to make it come back.


(Jay Pfaffman) #8

The system wasn’t exhibiting the signs that @sam described.

du postgres_data shows 9GB, so maybe the database is just too big? But it was working fine on a 4GB droplet before this problem started.

Load average is staying below 2.

Maybe it’s bad luck and that server at Digital Ocean is broken in some way.

@digitaldan, you might file a ticket with them and see if they know anything like the SSD is failing or something. Or some other user on that machine is breaking it.