Sidekiq is unexpectedly paused

We have a setup for both Topic Event / Post Event webhooks, which was working fine.
But suddenly it stopped working and on any Topic Event or Post Event no webhook was sent.

We have checked for Discourse log also but there is no error.

So is there any way we can troubleshoot this issue?

Check the sidekiq queue. Is it filed up with posts being rebaked due to the recent image changes?

Thanks for your reply,
I found all my web hook events in scheduled list in sidekiq.

Can you please help me with below things?

  1. How to fix it?
  2. what does schedule list means? and How to execute those web hook now?

Finally found the issue.
Sidekiq was paused.

Why was it paused? Did you pause it?

Not sure but may be it’s because of failure of backup.

1 Like

Make sure you have disk space and reboot.

2 Likes

What happens is sidekiq is paused during backup, but if backup fails, sidekiq is never unpaused. Didn’t you make a bunch of improvements here recently @tgxworld?

Are you on the latest version of Discourse @mhr?

We’ll always unpause Sidekiq even if the backup/restore fails.

4 Likes

we are currently using v2.1.0.beta6 +89

You need to update to latest.

2 Likes

This “pause sidekiq during backups” decision is the source of considerable trauma @tgxworld. Especially for giant backups going to slow offsite network dumps, or if something goes wrong with the backup and it never “finishes”, etc… users don’t get notifications and so on because Sidekiq is paused, and the longer this goes on… the worse the effects are on users.

Why did we make this decision to pause Sidekiq during backups? :thinking: do you remember @sam?

2 Likes

One reason is that if sidekiq is running, the backup can fail because an upload gets deleted when tar is running. But maybe having a backup fail isn’t that bad.

Previous to this fix we would have backups often fail due to transaction deadlocks with sidekiq jobs.

The intention here was always to simply pause sidekiq, run pgdump, unpause sidekiq, create tar, compress tar, upload backup

It is very possible there are some flow bugs we need to iron out, seeing issues in our hosting as well, will review this next week

6 Likes

I just committed an awesome fix by @tgxworld for multisite that makes this more reliable. @gerhard also reduced the sidekiq pause window!

We have some additional work left here (never allow you to pause for longer than N hours) so assigning this to @tgxworld till it is implemented.

4 Likes

A change was made to the backup process which makes it extra sure that the sidekiq process will be unpaused.

https://github.com/discourse/discourse/blob/46e62c0d22099bbdb86d98c9e78fa5687e7777f5/lib/backup_restore/backuper.rb#L35-L41

Closing this for now but feel free to flag this post if it happens again.

3 Likes