Sidekiq is being paused, how can I discover why?

Hi everyone,

For the last few weeks, my discourse instance stopped sending emails on three different occasions.

In all of them, it was because sidekiq was somehow paused.
I followed https://meta.discourse.org/t/sidekiq-stopped-working/26479 , and all the queue was cleared and all emails sent.

Interestingly enough, they all happened a few hours after an upgrade.

My question is: how can I discover what caused sidekiq to be paused?

1 Like

This happened to me recently and was driving me nuts!

It also cost me several new user sign-ups … :angry:

Spot where I fixed it! :relieved:

image

Use the rails console to unpause sidekiq. (This is only a temporary fix though)

Check your logs, it’s likely that you have an error. I was seeing Access Denied like messages. Check your S3 backup process if you have one. For some reason there must have been an update to it at some point that required additional permissions on Amazon S3. A broader policy on AWS S3 fixed it for me.

YMMV

3 Likes

Oooooooooh. That’s unexpected!

Yes, in fact, I saw that message a few minutes ago when trying to search for any indication why sidekiq was paused, but it didn’t occur to me it could be related.

I did fix that option (I don’t want to delete files from S3) a couple of minutes before raising this topic, but I didn’t think much of it. When I imported the data to the new server, it appears some options were lost.

I was going to follow it to make sure the error went away, but now I’m hopeful that will fix the sidekiq pausing as well!

Thanks

1 Like

Hmm why would S3 problems cause Sidekiq to completely stop @gerhard?

Just a guess - isn’t Sidekiq paused during a backup? (and perhaps understandably so because backup process takes up so much local computational resource)

So if this job falls over, it’s never automatically unpaused?

3 Likes

Oh yes, right. It gets paused during backup and never recovers. You’re exactly correct and I remember us running into this before.

I wonder if we should have a safety mechanism where sidekiq cannot be “paused for backup” for more than x hours, where x is maybe four? What do you think @gerhard?

An arbitrary admin pause of sidekiq should always be possible, but a “stuck forever for backup” pause doesn’t seem right to me.

1 Like

I started following that lead, and it seems pretty promising. \

I went to check my backups, and just the first one after upgraded was in S3.
I couldn’t even start a new backup in discourse, I had to cancel several times (I’d cancel it, and refresh the page, and cancel it again and again).

So it appears the timeline is more or less like:

  • We upgrade
  • in less than 24h, a backup will happen and it’s uploaded to S3
  • sidekiq is paused

Even if I unpaused sidekiq, no new backups were created, which was unexpected.

For the record, this is what I had for S3:

Access Denied
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/plugins/raise_response_errors.rb:15:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/aws-sdk-s3-1.14.0/lib/aws-sdk-s3/plugins/sse_cpk.rb:22:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/aws-sdk-s3-1.14.0/lib/aws-sdk-s3/plugins/dualstack.rb:26:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/aws-sdk-s3-1.14.0/lib/aws-sdk-s3/plugins/accelerate.rb:35:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/aws-sdk-core-3.21.2/lib/aws-sdk-core/plugins/jsonvalue_converter.rb:20:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/aws-sdk-core-3.21.2/lib/aws-sdk-core/plugins/idempotency_token.rb:17:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/aws-sdk-core-3.21.2/lib/aws-sdk-core/plugins/param_converter.rb:24:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/aws-sdk-core-3.21.2/lib/aws-sdk-core/plugins/response_paging.rb:10:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/plugins/response_target.rb:23:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/request.rb:70:in `send_request'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/aws-sdk-s3-1.14.0/lib/aws-sdk-s3/client.rb:1248:in `delete_object'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/aws-sdk-s3-1.14.0/lib/aws-sdk-s3/object.rb:571:in `delete'
/var/www/discourse/lib/s3_helper.rb:42:in `remove'
/var/www/discourse/app/models/backup.rb:64:in `remove_from_s3'
/var/www/discourse/app/models/backup.rb:40:in `after_remove_hook'
/var/www/discourse/app/models/backup.rb:31:in `remove'
/var/www/discourse/app/models/backup.rb:89:in `each'
/var/www/discourse/app/models/backup.rb:89:in `remove_old'
/var/www/discourse/lib/backup_restore/backuper.rb:257:in `remove_old'
/var/www/discourse/lib/backup_restore/backuper.rb:59:in `run'
/var/www/discourse/lib/backup_restore/backup_restore.rb:16:in `backup!'
/var/www/discourse/app/jobs/regular/create_backup.rb:8:in `execute'
/var/www/discourse/app/jobs/base.rb:137:in `block (2 levels) in perform'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/rails_multisite-2.0.4/lib/rails_multisite/connection_management.rb:63:in `with_connection'
/var/www/discourse/app/jobs/base.rb:127:in `block in perform'
/var/www/discourse/app/jobs/base.rb:123:in `each'
/var/www/discourse/app/jobs/base.rb:123:in `perform'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:187:in `execute_job'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:169:in `block (2 levels) in process'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/middleware/chain.rb:128:in `block in invoke'
/var/www/discourse/lib/sidekiq/pausable.rb:81:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/middleware/chain.rb:130:in `block in invoke'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/middleware/chain.rb:133:in `invoke'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:168:in `block in process'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:139:in `block (6 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/job_retry.rb:98:in `local'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:138:in `block (5 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq.rb:36:in `block in <module:Sidekiq>'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:134:in `block (4 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:199:in `stats'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:129:in `block (3 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/job_logger.rb:8:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:128:in `block (2 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/job_retry.rb:73:in `global'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:127:in `block in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/logging.rb:48:in `with_context'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/logging.rb:42:in `with_job_hash_context'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:126:in `dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:167:in `process'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:85:in `process_one'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:73:in `run'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/util.rb:16:in `watchdog'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/util.rb:25:in `block in safe_thread'

I’m not particularly sure if the automatic backup will work, but I managed to run 3 backups manually (and they are in S3).

Yes because it halts other processes like notifications and sign-up emails.

However the really important thing imho is to bring this to the admin’s attention ASAP.

It was only because I navigated to Sidekiq that I found the issue. This lead me eventually to check logs etc.

The backup process fails to delete old backups from S3. Unfortunately it crashes inside the ensure block and prevents Sidekiq from starting again. I’m going to fix that.

https://github.com/discourse/discourse/blob/master/lib/backup_restore/backuper.rb#L56-L63

6 Likes

I agree.
If a backup fails (even partially, to delete old files), I do not expect every other background process to be silently paused. I do expect admins to be somehow notified.

I just went to sidekiq because some support topics mentioned it, but I didn’t know my backups weren’t working.

1 Like

There is a warning on the admin dashboard for the rare case that Sidekiq stopped working.

But I agree, that a failed backup shouldn’t prevent Sidekiq from running. It will be fixed!

4 Likes

Thanks! That’s weird, because I did not see that message and looked at the Dashboard dozens of times during the paused phase. Perhaps it was just me …

Good to know this feature is there, though!

Appreciate the bug fix, many thanks!

@Gerhard I just paused my Sidekiq using rails c and do not see this warning … why might that be?:

5 Likes

The warning only appears when there are jobs enqueued and the last job was executed more than 2 minutes ago.

https://github.com/discourse/discourse/blob/7f420b61cb8ff01f4ced18acf6d542f177f5ff37/app/models/admin_dashboard_data.rb#L176-L179

2 Likes

Thanks for clarification! :slight_smile:

I saw the same thing as @merefield. Sidekiq had that big banner, but there was no banner in the admin dashboard.

I don’t have a screenshot, but from memory I had a huge number in scheduled (a few thousands), but enqueued was still zero.

4 Likes

We clearly need a better way of triggering this @gerhard

1 Like

I made some changes to the backup and restore process so that it always unpauses Sidekiq.
https://github.com/discourse/discourse/commit/469a2c36edf63923eadcb2b673f734bc592b81cd

You remembered correctly. There’s no warning in the dashboard when Sidekiq is paused, since jobs are scheduled but not enqueued. The warning appears only when Sidekiq isn’t running at all.

I could add an additional warning, but I don’t think it’s needed. Sidekiq should now always be unpaused – no matter what.

5 Likes