Backup restore failed, sidekiq won't stop then runsaway

I did a rebuild this morning and then tried restoring a backup into the container. I’m on 2.6.0.beta5
( 75a893fd61 ), with everything in the container.

The backup restore would normally work (i.e. has done previously), but today failed like this:

Starting restore: app-2020-11-06-033740-v20201009190955.tar.gz
[STARTED]
'system' has started the restore!
Marking restore as running...
Making sure /var/www/discourse/tmp/restores/default/2020-11-06-084354 exists...
Copying archive to tmp directory...
Unzipping archive, this may take a while...
Extracting dump file...
Validating metadata...
  Current version: 20201103103401
  Restored version: 20201009190955
Enabling readonly mode...
Pausing sidekiq...
Waiting up to 60 seconds for Sidekiq to finish running jobs...
Waiting for sidekiq to finish running jobs... #2
Waiting for sidekiq to finish running jobs... #3
Waiting for sidekiq to finish running jobs... #4
Waiting for sidekiq to finish running jobs... #5
Waiting for sidekiq to finish running jobs... #6
Waiting for sidekiq to finish running jobs... #7
Waiting for sidekiq to finish running jobs... #8
Waiting for sidekiq to finish running jobs... #9
Waiting for sidekiq to finish running jobs... #10
EXCEPTION: Sidekiq did not finish running all the jobs in the allowed time!
/var/www/discourse/lib/backup_restore/system_interface.rb:89:in `block in wait_for_sidekiq'
/var/www/discourse/lib/backup_restore/system_interface.rb:84:in `loop'
/var/www/discourse/lib/backup_restore/system_interface.rb:84:in `wait_for_sidekiq'
/var/www/discourse/lib/backup_restore/restorer.rb:47:in `run'
script/discourse:143:in `restore'
/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/thor-1.0.1/lib/thor/command.rb:27:in `run'
/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/thor-1.0.1/lib/thor/invocation.rb:127:in `invoke_command'
/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/thor-1.0.1/lib/thor.rb:392:in `dispatch'
/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/thor-1.0.1/lib/thor/base.rb:485:in `start'
script/discourse:284:in `<top (required)>'
/usr/local/lib/ruby/gems/2.6.0/gems/bundler-2.1.4/lib/bundler/cli/exec.rb:63:in `load'
/usr/local/lib/ruby/gems/2.6.0/gems/bundler-2.1.4/lib/bundler/cli/exec.rb:63:in `kernel_load'
/usr/local/lib/ruby/gems/2.6.0/gems/bundler-2.1.4/lib/bundler/cli/exec.rb:28:in `run'
/usr/local/lib/ruby/gems/2.6.0/gems/bundler-2.1.4/lib/bundler/cli.rb:476:in `exec'
/usr/local/lib/ruby/gems/2.6.0/gems/bundler-2.1.4/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
/usr/local/lib/ruby/gems/2.6.0/gems/bundler-2.1.4/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
/usr/local/lib/ruby/gems/2.6.0/gems/bundler-2.1.4/lib/bundler/vendor/thor/lib/thor.rb:399:in `dispatch'
/usr/local/lib/ruby/gems/2.6.0/gems/bundler-2.1.4/lib/bundler/cli.rb:30:in `dispatch'
/usr/local/lib/ruby/gems/2.6.0/gems/bundler-2.1.4/lib/bundler/vendor/thor/lib/thor/base.rb:476:in `start'
/usr/local/lib/ruby/gems/2.6.0/gems/bundler-2.1.4/lib/bundler/cli.rb:24:in `start'
/usr/local/lib/ruby/gems/2.6.0/gems/bundler-2.1.4/exe/bundle:46:in `block in <top (required)>'
/usr/local/lib/ruby/gems/2.6.0/gems/bundler-2.1.4/lib/bundler/friendly_errors.rb:123:in `with_friendly_errors'
/usr/local/lib/ruby/gems/2.6.0/gems/bundler-2.1.4/exe/bundle:34:in `<top (required)>'
/usr/local/bin/bundle:23:in `load'
/usr/local/bin/bundle:23:in `<main>'
Trying to rollback...
There was no need to rollback
Cleaning stuff up...
Removing tmp '/var/www/discourse/tmp/restores/default/2020-11-06-084354' directory...
Unpausing sidekiq...
Disabling readonly mode...
Marking restore as finished...
Notifying 'system' of the end of the restore...
Finished!
[FAILED]
Restore done.

After that I have a ruby processes using 100% CPU for hours. This process is described as:

# ps aux | grep sidekiq
discour+    141  100  5.0 9302596 401484 ?      SNl  06:34 127:46 sidekiq 6.1.2 discourse [5 of 5 busy]

If I stop and restart the container, this sidekiq comes back to 100%. sidekiq.log is empty, production.log doesn’t show me much.

How can I find out what this sidekiq is doing? Has anyone else found backup’s restore find with this version?

3 Likes

A kind gent pointed me at the sidekiq console and it looks like siqkiq is busy with thumbnail generation:

The backups I use don’t include thumbnails, so this might be expected except I’d already ran rake posts:rebake and then rake posts:rebake_uncooked_posts after the previous test backup restore so thought that would have done all the thumbnail generation (the rebake took about about 1 hour for something like 100,000 posts, the rebake_uncooked did nothing as I guess the full rebake had taken care of everything).

Assuming this workload is the thing that prevents the backup succeeding, shouldn’t restoring a backup first purge the the task queues?

Also why is sidekiq generating thumbnails when there’s apparently nothing to do, or do the rebake commands just queue the work?

# rake posts:rebake_uncooked_posts
Rebaking uncooked posts on default

0 posts done!
3 Likes

That’s right. The rebake queues up a bunch of stuff rather than actually doing it all.

Probably. But it’s mostly assumed that if you know enough to restore a backup then you know enough to clear the sidekiq queue first; usually you’d restore to a site that was empty.

I belive that’s the case.

2 Likes

Thanks, this is starting to make a lot more sense.

So would the correct sequence be to first destroy, then (re)build the container, then enter it and restore the backup?

If that’s the case, great, though perhaps Discourse could provide that hint/warning when starting a restore into a non-empty container.

That’s the nuclear option (and you’d really only need to delete the redis data, I guess and then rebuild). Usually, you don’t have a zillion jobs running and it’s just not an issue. But since you’re apparently doing two restores in a row and the first one hasn’t totally finished when you start again, you’re s special case.

The other thing to do would be to go to /sidekiq and delete all of those queued jobs. It takes just a few clicks.

3 Likes

A good test for disaster recovery then :smiley:

You are right that it probably is a bit of a special case. I’ve been trying to get the rebake time down, so restoring backups and timing it in different circumstances (restoring without thumbnails takes 2 to 3 days to complete rebaking by default!). This is all groundwork for a site migration.

I think I understand a lot more now, thank you. But for something as critical as backups, I think it would be sensible for Discourse to ensure it is as reliable as possible (e.g. purge sidekiq queues prior to starting), or issue clear warnings that other steps are needed if it detects potential problems for restoration.

Only one you know to do that.

I suggest you create a backup with thumbnails (include_thumbnails_in_backups site setting) when you want to migrate to a different server.

3 Likes

Thanks! This is an option I’ve tried too, but the backup, while only being about and extra 6GB seems to take around and extra hour to create. It looks to be the tar step taking the time - there maybe something on my older hardware where lots of small files are slow to access. Gzip level is set to 1, so I think it is IO limited rather then CPU. The new host can do the same backup in 30 - 40 minutes.

Possibly I’m just being a bit impatient with this though, and I should accept the backup taking longer, though with the primary site in read-only mode during this process I am keen to minimise that time.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.