Implosion after 2.7beta7 upgrade

Hi all,

We have run a self-hosted Discourse instance at https://discourse.bokeh.org for a number of years. Generally speaking, it has been rock-solid an almost no effort to maintain, and in particular, performing updates is usually always a complete non-event that completes perfectly without any issues.

However today after an update to 2.7beta7 (that seemed to complete without issue), our site has completely imploded. It limped along for a bit with pages mis-rendered and JS console errors, but after attempting a rollback, the UI it became non-functional. Logging in to the droplet, I have also tried to no avail:

Rebuild

./launcher rebuild app

This has failed in several ways over several tries.

Discourse Doctor

./discourse-doctor

Restore

./launcher enter app
discourse restore <backup file>

This failed

Wipe

I also tried doing an “wipe” and then restoring

./launcher stop app
./launcher destroy app
rm -r /var/discourse/shared/standalone/

after this I was at least able to get a rebuild to succeed which led to a “fresh install” state, e.g. " Congratulations, you installed Discourse!"

So now I have tried running discourse restore again but this has failed again

EXCEPTION: 1 posts are not remapped to new S3 upload URL. S3 migration failed for db 'default'. /var/www/discourse/lib/file_store/to_s3_migration.rb:131:in `raise_or_log' /var/www/discourse/lib/file_store/to_s3_migration.rb:86:in `migration_successful?' /var/www/discourse/lib/file_store/to_s3_migration.rb:357:in `migrate_to_s3' /var/www/discourse/lib/file_store/to_s3_migration.rb:65:in `migrate' /var/www/discourse/lib/file_store/s3_store.rb:240:in `copy_from' /var/www/discourse/lib/backup_restore/uploads_restorer.rb:62:in `restore_uploads' /var/www/discourse/lib/backup_restore/uploads_restorer.rb:44:in `restore' /var/www/discourse/lib/backup_restore/restorer.rb:62:in `run' script/discourse:145:in `restore' /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/thor-1.1.0/lib/thor/command.rb:27:in `run' /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/thor-1.1.0/lib/thor/invocation.rb:127:in `invoke_command' /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/thor-1.1.0/lib/thor.rb:392:in `dispatch' /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/thor-1.1.0/lib/thor/base.rb:485:in `start' script/discourse:286:in `' /usr/local/lib/ruby/gems/2.7.0/gems/bundler-2.2.7/lib/bundler/cli/exec.rb:63:in `load' /usr/local/lib/ruby/gems/2.7.0/gems/bundler-2.2.7/lib/bundler/cli/exec.rb:63:in `kernel_load' /usr/local/lib/ruby/gems/2.7.0/gems/bundler-2.2.7/lib/bundler/cli/exec.rb:28:in `run' /usr/local/lib/ruby/gems/2.7.0/gems/bundler-2.2.7/lib/bundler/cli.rb:494:in `exec' /usr/local/lib/ruby/gems/2.7.0/gems/bundler-2.2.7/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run' /usr/local/lib/ruby/gems/2.7.0/gems/bundler-2.2.7/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command' /usr/local/lib/ruby/gems/2.7.0/gems/bundler-2.2.7/lib/bundler/vendor/thor/lib/thor.rb:392:in `dispatch' /usr/local/lib/ruby/gems/2.7.0/gems/bundler-2.2.7/lib/bundler/cli.rb:30:in `dispatch' /usr/local/lib/ruby/gems/2.7.0/gems/bundler-2.2.7/lib/bundler/vendor/thor/lib/thor/base.rb:485:in `start' /usr/local/lib/ruby/gems/2.7.0/gems/bundler-2.2.7/lib/bundler/cli.rb:24:in `start' /usr/local/lib/ruby/gems/2.7.0/gems/bundler-2.2.7/exe/bundle:49:in `block in ' /usr/local/lib/ruby/gems/2.7.0/gems/bundler-2.2.7/lib/bundler/friendly_errors.rb:130:in `with_friendly_errors' /usr/local/lib/ruby/gems/2.7.0/gems/bundler-2.2.7/exe/bundle:37:in `' /usr/local/bin/bundle:23:in `load' /usr/local/bin/bundle:23:in `' Trying to rollback... Rolling back... Cleaning stuff up... Dropping functions from the discourse_functions schema... Removing tmp '/var/www/discourse/tmp/restores/default/2021-04-23-235404' directory... Marking restore as finished... Notifying 'system' of the end of the restore... Finished! [FAILED]

What’s odd is during the restore, the site seemed to be getting back to normal, with old content showing up. The the failure happened and now nothing shows up, accounts are gone, etc.

I could really use any guidance or suggestions. here. We have daily backups going back a week (futher in glacier if need be). We deleted an unused Category a few days ago, could that be the cause of problems? I will try an older back up to see, but any pointers to an iron-clad restore process would be welcome.

2 Likes

An older backup did not help. During the restore, things look “fine-ish”

Right up until the end, then immediately go to this:

As an aside, does the backup not contain posts that were imported originally from a mailing list, but have been on the Discourse site for years?? We had 24k posts, not 5k posts prior to this.

2 Likes

Is there a way to rebuild the app at different discourse at an older version of discourse, e.g at 2.7beta6?

2 Likes

Alternatively, I guess there is literally jsut a single post causing some problem

EXCEPTION: 1 posts are not remapped to new S3 upload URL. S3 migration failed for db ‘default’.

It it possible to find this post and just delete or remove it?

2 Likes

Do you have images on s3?

2 Likes

@pfaffman Yes we do have images on s3

2 Likes

Hmm. Well maybe there is just one broken post? I guess you’d need to restore the database, fix it, then rebuild the backup file with the new database. But it’s really hard to tell.

2 Likes

Are there instructions somewhere for how to do that? How can I determine what the one broken post is?

Edit: alternatively is there someone with expertise that can be contracted for services?

2 Likes

Do you have a fresh droplet backup or snapshot what you can restore?

2 Likes

Do you have a fresh droplet backup or snapshot what you can restore?

Alas, no. DO backups are only weekly and I had set up daily Discourse backups to s3 keeping a weeks worth fresh and a year back in glacier. Given my previous positive experience with many Discourse upgrades, I had though that would be both sufficient and better (and I had done successful test restores before)

2 Likes

let’s say
version: 94301854938a0b36dd64666fb7a7c8406544a781 which is the commit just before the beta bump

3 Likes

Well, update actually. In a fit of desperation I simply hard aborted the restore script during the

Syncing files to S3

step, before it complained about the 1 bad post and started a rollback.

The site actually came back up and was accessible in safe-mode. I disabled the “COPY PASTE” theme component for code blocks that is apparently now wildly incompatible with the latest beta. After that, the site seems to actually be mostly in working order even without safe mode. But:

  • Is there any suggested actions to make sure things are as “cleaned up” as possible? e.g. re-upload assets to s3 and “rebake”? Where are the best, most current instructions for that?

EDIT: as far as I can tell, everything is back to normal. Images in old posts are loading correctly from S3/CDN. I guess then my question is, if we are uploading images to S3, should this option be unchecked?

Include uploads in scheduled backups. Disabling this will only backup the database.

I guess I thought it offered and extra layer of redundancy to have it checked, but it seems like it is the source of all these issues during restore?

2 Likes

Last time I move to other hosting provider I had issues with S3 too when restore. So I asked about it and the answer is yes. When you store your images in S3 than you have to make the backup without uploads. Only db. But I don’t know in your case this is the solution. If you will try it create before a snapshot.

You can see more here: Restore a backup from command line - #28 by itsbhanusharma

I just try it and worked great without any problem. :slightly_smiling_face:

2 Likes

I can help. On Monday, or sooner for hazard pay.

You can Contact — Literate Computing, LLC our S3 the email address at the bottom of the page.

2 Likes

@pfaffman Thank you! The site seems to now be functioning completely normally, but I would not turn down a quick look-over / sanity check at your convenience, if you are willing. I’ve noted your commercial contact info for any future emergencies regardless. :slight_smile:

Here is a retro, in case it is useful for anyone else


SUMMARY

After a (successful) upgrade, an incompatible unofficial theme component plugin rendered the site UI broken. This was not known at the time, so the backup restore process was initiated, but ran in to problems due to a configuration that was less than ideal.

DETAILS

  1. An upgrade to 2.7beta7 was initiated, and completed successfully.

  2. However, after the upgrade, the site UI was severely compromised: post bodies were entirely missing, top navigation (including user / login) was entirely missing, JS console reported errors

    1. The reason for this turned out to be an incompatible third-party them component for copy-pasting codeblocks, but this was not known at the time.

    2. Also not known at the time as the possibility of entering safe-mode Had that been known, the remaining problems could have been avoided

  3. Access was gained to \admin by direct navigation, and an attempted rollback was initiated. This immediately logged out the user, and it did not appear that there was any way to log back in with the broken UI

  4. Logged in to DO Droplet to initiate a manual restore with the latest backup tarball from S3.

  5. Many restore attempts failed near the final step, after uploading S3 assets, due to a single post having some error

    1. This is evidently because restores that try to re-upload assets to S3 can be flaky (several reports of this on Meta)
    2. However, did not need to be backing up upload assets, since they are already stored on S3!
  6. During one of the many restore attempts, the site was also wiped in order to try from a “clean slate” so the rollbacks after the restore failures now rolled back to an empty site.

  7. Eventually, in a desperate gamble, I ran the restore, and hard-aborted the script during the S3 upload (right before the failure and rollback)

  8. The site came back online, but exhibited the previous UI problems. However now safe-mode was known and used and the site functioned normally with plugins and themes disabled

  9. All unofficial plugins and themes were removed (include the “copy-paste” component, after which the site functioned normally

  10. Verified that previously uploaded images were still loading, and from the S3 CDN

I suspect the final upload and “rebake” was not needed since the assets were already still on S3, and the posts did not need to be updated to use new URLs.

I am not certain what if any remaining restore steps were missed after the script was aborted, but so far no-one has encountered any issues with the site in its current state.

LESSONS / ACTIONS

  • Start with safe-mode for diagnosing problems in the future
  • Turn off the setting to include uploads in backups (will suggest to Discourse to warn about this situation)
  • Remove all unofficial plugins and theme components
  • Suggest adding Turn on new built-in codeblock copy feature
  • Enable Digital Ocean weekly image backups as a backstop for recovery
6 Likes

Theme Component or Plugin? Could you let the author know?

3 Likes

@merefield It seems to be known, which is how I came to learn of the possibility

4 Likes

If I were to offer one suggestion re: the restore process, it would be to offer some options to make it more resilient. E.g. if all that is standing in the way of a successful restore is one bad post, I would mash the Y key in a heartbeat if asked “Delete the bad post?”

3 Likes

That will be in one of the next 2.8 betas. Unfortunately it isn’t ready for the upcoming 2.7 release yet.

I’m sorry I didn’t see your cry for help earlier. Here’s a tip for everyone else who struggles with uploads stored on S3 messing up a restore: Extract the dump.sql.gz file from your backup and rename the file. E.g. when the original backup was discourse-2020-10-09-133921-v20201007124955.tar.gz then the resulting file should be called discourse-2020-10-09-133921-v20201007124955.sql.gz. Restoring that file should work.

6 Likes