I meant to post this to @mperham on twitter but it does not really fit in a tweet Quite a few times we have notice a “stuck” sidekiq. All the jobs in the queue appear to be “stuck” in a running state forever.
I managed to isolate what is causing it today and would like some tips on how to correct it. It appears the “identity” of job’s owner is determined based on hostname:PID
If you hang-up a processor Sidekiq “detects” that a hostname:PID
is not around about 20-30 seconds after its is terminated. When that happens it will clear all the jobs it thought that hostname:PID
owned.
However in a docker world we are able to do very quick restarts that very often maintain the same hostname:pid
after restart before Sidekiq detects that the old job processor died. This sequence leaves us with jobs that appear to be running forever.
This issue has popped up many times on many customers and internally, but we never really picked up on the reason till now. I guess my questions are:
-
Can we amend sidekiq so it uses an additional piece of data for job processor identity, eg:
hostname:PID:guid
? -
Can we improve our shutdown sequence to cleanly clear the currently running jobs?
-
Any other ideas? (timeout can work here, but it will have a very delayed effect)
Note: we can not use any of the Sidekiq professional or enterprise features as we need the fix to apply to all Discourse open source users.