I’ve got a bit of a mysterious issue going on with my sidekiq processes: they keep dying and they don’t come back up.
The monitoring of sidekiq processes that @sam talks about here https://meta.discourse.org/t/running-sidekiq-more-efficiently-and-reliably/15001 does seem to be working, sometimes.
The unicorn stderr logs are filled with
Detected dead worker <xxxx>, restarting..., averaging about 40 a day, and I’ve witnessed the process come back up after one of these log messages, so this is working (at least sometimes).
There’s the occasional worker timeout (like
ERROR -- : worker=0 PID:13069 timeout (31s > 30s), killing), and a lot of
Passing 'connection' command to redis as is; blind passthrough has been deprecated and will be removed in redis-namespace 2.0 (at /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.2.5/lib/sidekiq/web/helpers.rb:152:in `block in redis_connection'), but I can’t see that these would bring sidekiq down.
And when I say sidekiq is down, I mean it’s totally dead, it’s an ex-job scheduler. The process doesn’t exist.
Each of our instances is running with 1.7GB of memory allocated, so this shouldn’t be an OOM problem.
For the time being I’ve left Discourse running with only one sidekiq process so I know as soon as it dies, and I can have a trawl through the logs for anything that might be causing it.
Has anyone got any ideas what might be causing this, or any logfiles I should be looking at which I might not be?
tmp/pids/sidekiq_0.pid does exist (but the process it points to doesn’t)