HI, I’m a member of the openHAB team, we are one of the largest open source home automation projects. We are also affiliated with the Eclipse Smart Home project which we contribute to, and use as the base of our own project (think of us as the reference implementation) .
Our community forums have been down for the last 24 hours and we are struggling to get them back online.
The Issue
Unicorn processes seem to be killed after 30 seconds, we see this in shared/standalone/log/rails/unicorn.stderr.log
E, [2017-09-21T14:29:18.466973 #261] ERROR -- : worker=0 PID:13791 timeout (43s > 30s), killing
E, [2017-09-21T14:29:18.467612 #261] ERROR -- : worker=2 PID:13823 timeout (36s > 30s), killing
E, [2017-09-21T14:29:18.467874 #261] ERROR -- : worker=1 PID:14395 timeout (33s > 30s), killing
E, [2017-09-21T14:29:18.547036 #261] ERROR -- : worker=0 PID:13791 timeout (43s > 30s), killing
E, [2017-09-21T14:29:18.547289 #261] ERROR -- : worker=1 PID:14395 timeout (33s > 30s), killing
Once this starts happening, we have to restart the container or this happens indefinitely. Most times it requires more then one restart to be functional again.
We doubled the size of our hardware 2 days ago, and now can not keep the site up for more then a few mins,
What we have done so far
For the last couple weeks this has been happening more frequently, and we have needed to restart it to bring it back online.
We were using a Digital Ocean 2 cpu / 4gb instance and made the decision to double that hoping that more mem and cpu will help handle the load.
After upgrading to 4 cpu / 8gb memory and rebuilding our container with an appropriate app.yml our problem seems to be worse now, not better. We still have the exact same errors, but now our site can’t seem to be up for more then a few minutes at a time. Our current settings look like the following
db_shared_buffers: "2GB"
db_work_mem: "40MB"
UNICORN_WORKERS: 4
I have also tried increasing the shared buffers, workers and a combination of both together, but nothing has seemed to help. Restarting no longer help as well.
Git log shows we are running revision ee9a607677106603ace3e90f3c5bd9ca84a11ef7
which was commited on Fri Sep 15 08:27:27 2017 +0800
Next Steps
We are currently out of ideas after doubling the hardware. I’m not sure why the unicorn processes are hanging and then being killed under load, and why doubling the hardware has made things worse, not better.
any help by the community would be very, very appreciated.
Thanks
Dan-