Sidekiq dying and not coming back up

I’ve got a bit of a mysterious issue going on with my sidekiq processes: they keep dying and they don’t come back up.

The monitoring of sidekiq processes that @sam talks about here Running Sidekiq more efficiently and reliably does seem to be working, sometimes.

The unicorn stderr logs are filled with Detected dead worker <xxxx>, restarting..., averaging about 40 a day, and I’ve witnessed the process come back up after one of these log messages, so this is working (at least sometimes).

There’s the occasional worker timeout (like ERROR -- : worker=0 PID:13069 timeout (31s > 30s), killing), and a lot of Passing 'connection' command to redis as is; blind passthrough has been deprecated and will be removed in redis-namespace 2.0 (at /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.2.5/lib/sidekiq/web/helpers.rb:152:in `block in redis_connection'), but I can’t see that these would bring sidekiq down.

And when I say sidekiq is down, I mean it’s totally dead, it’s an ex-job scheduler. The process doesn’t exist.

Each of our instances is running with 1.7GB of memory allocated, so this shouldn’t be an OOM problem.

For the time being I’ve left Discourse running with only one sidekiq process so I know as soon as it dies, and I can have a trawl through the logs for anything that might be causing it.

Has anyone got any ideas what might be causing this, or any logfiles I should be looking at which I might not be?

EDIT: tmp/pids/sidekiq_0.pid does exist (but the process it points to doesn’t)

Have you identified a job that is never finishing?

Maybe trying to enable the new “DISCOURSE_LOG_SIDEKIQ” feature that @david added recently might help?

4 Likes

The new log might help. If you’re trying to track down a never-ending job, add DISCOURSE_LOG_SIDEKIQ_INTERVAL=1 to log in-progress jobs every second. If you have any suggestions for improving the log please do let me know @LeoMcA

4 Likes

Thanks for the comments, both - I imagine this new log will help a lot!

I’ve currently got it logging active jobs every minute, but I guess I’ll have to be really lucky to find the job that kills sidekiq with it set that high (or the job will have to be running a really long time).

@david what performance impact are we looking at logging every second? I suppose all we’d hang is the sidekiq thread, which is hardly a problem when it’s not working properly anyway.

Log looks great, very nice to see it as JSON. My plan is to convert that to CSV, so I can open it as a spreadsheet and filter by the pid of the sidekiq process which died when that happens.

Perhaps for casual log-viewing, separate files for the finished and pending jobs might be useful, but I don’t think anyone’s likely to be viewing this log without some time on their hands to grep through it.

1 Like

In my testing it hasn’t caused any noticeable impact. In theory it could cause a very slight performance hit on the Sidekiq process. The impact should be fairly minimal because the logging disk i/o is performed in a separate thread, allowing other things to continue in the main worker thread.

Each JSON object is on its own line, so you can simply do something like

cat sidekiq.log | grep pending

And then parse the JSON that you get out of that.

3 Likes