Help restoring - system hung at midnight

Okay this was bound to happen, things were just too good for too long. Years of running on cruise control, the system would automatically update itself and I would update Discourse every few weeks. At midnight last night, Amazon showed the system was unresponsive, discourse was down and the CPU was pegged at 100% until it ran out of AWS CPU resources. Couldn’t login to the system, the one time after several reboots I was able to login momentarily, I saw this in htoptaking up a lot of CPU

snap lxd activate

If anyone has seen this can can throw some light as to why this may happened by itself, it would be much appreciated for future reference.

Coming the pressing issue at hand, I proceeded to rebuild a new server on AWS using Ubuntu 20LTS, Discourse setup was exceedingly easy. I had a copy of the app.yml file which I used to recreate the discourse forum. The old server was using S3 for the backups AND for the content (images etc).

After creating the server, I downloaded the latest discourse backup file from S3, manually uploaded it to the discourse server and hit the Restore buttons. After a few minutes I get this error.

[2022-06-09 09:01:56] ALTER TABLE
[2022-06-09 09:01:56] ALTER TABLE
[2022-06-09 09:01:56] Migrating the database...
[2022-06-09 09:02:11] == 20220308201942 CreateUploadReferences: migrating ===========================
-- create_table(:upload_references, {})
   -> 0.0486s
-- add_index(:upload_references, [:upload_id, :target_type, :target_id], {:unique=>true, :name=>"index_upload_references_on_upload_and_target"})
   -> 0.0030s
== 20220308201942 CreateUploadReferences: migrated (0.0580s) ==================

== 20220309132719 CopyPostUploadsToUploadReferences: migrating ================
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT post_uploads.upload_id, 'Post', post_uploads.post_id, uploads.created_at, uploads.updated_at\nFROM post_uploads\nJOIN uploads ON uploads.id = post_uploads.upload_id\nON CONFLICT DO NOTHING\n")
   -> 0.0595s
== 20220309132719 CopyPostUploadsToUploadReferences: migrated (0.0602s) =======

== 20220309132720 CopyPostUploadsToUploadReferencesForSync: migrating =========
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT upload_id, 'Post', post_id, NOW(), NOW()\nFROM post_uploads\nON CONFLICT DO NOTHING\n")
   -> 0.0076s
== 20220309132720 CopyPostUploadsToUploadReferencesForSync: migrated (0.0080s) 

== 20220330160747 CopySiteSettingsUploadsToUploadReferences: migrating ========
-- execute("WITH site_settings_uploads AS (\n  SELECT id, unnest(string_to_array(value, '|'))::integer upload_id\n  FROM site_settings\n  WHERE data_type = 17\n  UNION\n  SELECT id, value::integer\n  FROM site_settings\n  WHERE data_type = 18 AND value != ''\n)\nINSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT site_settings_uploads.upload_id, 'SiteSetting', site_settings_uploads.id, uploads.created_at, uploads.updated_at\nFROM site_settings_uploads\nJOIN uploads ON uploads.id = site_settings_uploads.upload_id\nON CONFLICT DO NOTHING\n")
   -> 0.0034s
== 20220330160747 CopySiteSettingsUploadsToUploadReferences: migrated (0.0038s) 

== 20220330160751 CopyBadgesUploadsToUploadReferences: migrating ==============
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT badges.image_upload_id, 'Badge', badges.id, uploads.created_at, uploads.updated_at\nFROM badges\nJOIN uploads ON uploads.id = badges.image_upload_id\nWHERE badges.image_upload_id IS NOT NULL\nON CONFLICT DO NOTHING\n")
   -> 0.0006s
== 20220330160751 CopyBadgesUploadsToUploadReferences: migrated (0.0010s) =====

== 20220330160754 CopyGroupsUploadsToUploadReferences: migrating ==============
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT groups.flair_upload_id, 'Group', groups.id, uploads.created_at, uploads.updated_at\nFROM groups\nJOIN uploads ON uploads.id = groups.flair_upload_id\nWHERE groups.flair_upload_id IS NOT NULL\nON CONFLICT DO NOTHING\n")
   -> 0.0050s
== 20220330160754 CopyGroupsUploadsToUploadReferences: migrated (0.0055s) =====

== 20220330160757 CopyUserExportsUploadsToUploadReferences: migrating =========
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT user_exports.upload_id, 'UserExport', user_exports.id, uploads.created_at, uploads.updated_at\nFROM user_exports\nJOIN uploads ON uploads.id = user_exports.upload_id\nON CONFLICT DO NOTHING\n")
   -> 0.0013s
== 20220330160757 CopyUserExportsUploadsToUploadReferences: migrated (0.0041s) 

== 20220330164740 CopyThemeFieldsUploadsToUploadReferences: migrating =========
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT theme_fields.upload_id, 'ThemeField', theme_fields.id, uploads.created_at, uploads.updated_at\nFROM theme_fields\nJOIN uploads ON uploads.id = theme_fields.upload_id\nWHERE type_id = 2\nON CONFLICT DO NOTHING\n")
   -> 0.0006s
== 20220330164740 CopyThemeFieldsUploadsToUploadReferences: migrated (0.0010s) 

== 20220404195635 CopyCategoriesUploadsToUploadReferences: migrating ==========
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT categories.uploaded_logo_id, 'Category', categories.id, uploads.created_at, uploads.updated_at\nFROM categories\nJOIN uploads ON uploads.id = categories.uploaded_logo_id\nWHERE categories.uploaded_logo_id IS NOT NULL\nON CONFLICT DO NOTHING\n")
   -> 0.0095s
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT categories.uploaded_background_id, 'Category', categories.id, uploads.created_at, uploads.updated_at\nFROM categories\nJOIN uploads ON uploads.id = categories.uploaded_background_id\nWHERE categories.uploaded_background_id IS NOT NULL\nON CONFLICT DO NOTHING\n")
   -> 0.0004s
== 20220404195635 CopyCategoriesUploadsToUploadReferences: migrated (0.0103s) =

== 20220404201949 CopyCustomEmojisUploadsToUploadReferences: migrating ========
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT custom_emojis.upload_id, 'CustomEmoji', custom_emojis.id, uploads.created_at, uploads.updated_at\nFROM custom_emojis\nJOIN uploads ON uploads.id = custom_emojis.upload_id\nWHERE custom_emojis.upload_id IS NOT NULL\nON CONFLICT DO NOTHING\n")
   -> 0.0032s
== 20220404201949 CopyCustomEmojisUploadsToUploadReferences: migrated (0.0036s) 

== 20220404203356 CopyUserProfilesUploadsToUploadReferences: migrating ========
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT user_profiles.profile_background_upload_id, 'UserProfile', user_profiles.user_id, uploads.created_at, uploads.updated_at\nFROM user_profiles\nJOIN uploads ON uploads.id = user_profiles.profile_background_upload_id\nWHERE user_profiles.profile_background_upload_id IS NOT NULL\nON CONFLICT DO NOTHING\n")
   -> 0.0017s
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT user_profiles.card_background_upload_id, 'UserProfile', user_profiles.user_id, uploads.created_at, uploads.updated_at\nFROM user_profiles\nJOIN uploads ON uploads.id = user_profiles.card_background_upload_id\nWHERE user_profiles.card_background_upload_id IS NOT NULL\nON CONFLICT DO NOTHING\n")
   -> 0.0011s
== 20220404203356 CopyUserProfilesUploadsToUploadReferences: migrated (0.0033s) 

== 20220404204439 CopyUserAvatarsUploadsToUploadReferences: migrating =========
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT user_avatars.custom_upload_id, 'UserAvatar', user_avatars.id, uploads.created_at, uploads.updated_at\nFROM user_avatars\nJOIN uploads ON uploads.id = user_avatars.custom_upload_id\nWHERE user_avatars.custom_upload_id IS NOT NULL\nON CONFLICT DO NOTHING\n")
   -> 0.0200s
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT user_avatars.gravatar_upload_id, 'UserAvatar', user_avatars.id, uploads.created_at, uploads.updated_at\nFROM user_avatars\nJOIN uploads ON uploads.id = user_avatars.gravatar_upload_id\nWHERE user_avatars.gravatar_upload_id IS NOT NULL\nON CONFLICT DO NOTHING\n")
   -> 0.0069s
== 20220404204439 CopyUserAvatarsUploadsToUploadReferences: migrated (0.0276s) 

== 20220404212716 CopyThemeSettingsUploadsToUploadReferences: migrating =======
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT theme_settings.value::int, 'ThemeSetting', theme_settings.id, uploads.created_at, uploads.updated_at\nFROM theme_settings\nJOIN uploads ON uploads.id = theme_settings.value::int\nWHERE data_type = 6 AND theme_settings.value IS NOT NULL AND theme_settings.value != ''\nON CONFLICT DO NOTHING\n")
   -> 0.0025s
== 20220404212716 CopyThemeSettingsUploadsToUploadReferences: migrated (0.0030s) 

== 20220526203356 CopyUserUploadsToUploadReferences: migrating ================
-- execute("INSERT INTO upload_references(upload_id, target_type, target_id, created_at, updated_at)\nSELECT users.uploaded_avatar_id, 'User', users.id, uploads.created_at, uploads.updated_at\nFROM users\nJOIN uploads ON uploads.id = users.uploaded_avatar_id\nWHERE users.uploaded_avatar_id IS NOT NULL\nON CONFLICT DO NOTHING\n")
   -> 0.0227s
== 20220526203356 CopyUserUploadsToUploadReferences: migrated (0.0234s) =======


[2022-06-09 09:02:11] Reconnecting to the database...
[2022-06-09 09:02:12] Reloading site settings...
[2022-06-09 09:02:12] Disabling outgoing emails for non-staff users...
[2022-06-09 09:02:14] Disabling readonly mode...
[2022-06-09 09:02:14] Clearing category cache...
[2022-06-09 09:02:14] Reloading translations...
[2022-06-09 09:02:14] Remapping uploads...
[2022-06-09 09:02:14] Restoring uploads, this may take a while...
[2022-06-09 09:03:05] EXCEPTION: 509 of 1823 uploads are not migrated to S3. S3 migration failed for db 'default'.
[2022-06-09 09:03:05] /var/www/discourse/lib/file_store/to_s3_migration.rb:132:in `raise_or_log'
/var/www/discourse/lib/file_store/to_s3_migration.rb:79:in `migration_successful?'
/var/www/discourse/lib/file_store/to_s3_migration.rb:373:in `migrate_to_s3'
/var/www/discourse/lib/file_store/to_s3_migration.rb:66:in `migrate'
/var/www/discourse/lib/file_store/s3_store.rb:328:in `copy_from'
/var/www/discourse/lib/backup_restore/uploads_restorer.rb:62:in `restore_uploads'
/var/www/discourse/lib/backup_restore/uploads_restorer.rb:44:in `restore'
/var/www/discourse/lib/backup_restore/restorer.rb:61:in `run'
/var/www/discourse/script/spawn_backup_restore.rb:23:in `restore'
/var/www/discourse/script/spawn_backup_restore.rb:36:in `block in <main>'
/var/www/discourse/script/spawn_backup_restore.rb:4:in `fork'
/var/www/discourse/script/spawn_backup_restore.rb:4:in `<main>'

Can anyone advise on what’s the problem and how I can restore the server from the Amazon S3 server backups?

Hi,
after recreating the new server from the app.yml, did you have access to the backups in the https://your.domain/admin/backups section ?

No, after recreating from app.yml it just gave me a clean new setup with nothing. I downloaded the last backup from S3 and manually uploaded it to discourse locally and hit restore.

I starts restoring it, I see all the settings (including the S3, the credentials everything from the Settings page) come back, I see all the categories show up and posts show and everything. Then suddenly after a few minutes I get a log out message and all the categories, topics disappear and the logs shows that error (It appears to roll back).

:thinking: so the s3 config isn’t in the app.yml (as described here Using Object Storage for Uploads (S3 & Clones) )? But configured as in Setting up file and image uploads to S3

1 Like

No I don’t see this in my app.yml

All the S3 settings were defined in the Admin → Settings pages and it was working fine for year until I needed to restore it when the server went down last night.

Correct, this is what I used to setup the S3 backups and uploads.

I think I would try editing the app.yml with your settings and from there I believe (hope?) you should see your backups in the admin section and restore from there, without the manual import and the uploads, which you shouldn’t have to restore, but seem to be included in the backups. I don’t know why it is failing though…

1 Like

I think this may be because your backup contains a mix of S3 and local uploads. I’m afraid it’s not my area of expertise, but there is some discussion and a workaround in this topic which allows you to bypass the failure. However, it was for a much smaller number of errors so you may want to take that into account:

2 Likes

Thanks, my site has crashed unfortunately so I’m only left with the S3 uploads and back ups. I’m assuming there’s no way to for me to migrate any left over local files to the S3 now.

So I’m wondering what are my options right now? Is there a way to restore from S3 backups and ignore the local files? I found a way to have it ignore the S3 upload but then pretty much all the posts have broken links/images (90%+ is probably in the S3 because I setup the S3 upload many many years ago).

So an update for folks who may be struggling with the same issue (basically I’m unable to restore from a backup and the server crashed due to a fault system upgrade).

From what I understand the root cause of the issue is that there are local uploads AND there are S3 uploads, so when the restore tool is trying to restore it’s bugging out because it doesn’t know how to handle local and S3 restores at the same time (maybe it’s time for Discourse to relook at backup/restores).

Thanks to @RGJ for this tip, he suggested force discourse to ignore the S3 upload while restoring:

  1. Add a line to your app.yml DISCOURSE_ENABLE_S3_UPLOADS=false
  2. Rebuild discourse ./launcher rebuild app
  3. Attempt a restore (either from the GUI Backup page or using the CLI)
  4. Then after restoring, remove that line from app.yml and rebuild one more time

While this worked, point to note that the forum was badly broken, the categories, settings and posts were restored, however all the images, links, embedded documents etc were broken and errored out.

The hail-mary solution:
I managed to salvage the old server and extracted the /var/discourse directory (tar/gz) and copied it onto the new server and did a ./launcher rebuild app. This completely restored the operation of the forum, however the fundamental problem still remains - the backups will NOT work because they have a mix of local and S3 uploads.

So I really need some advice on the best way to fix this issue once and for all. Is it better/easier to move all the upload from local to S3 or from S3 to local and how does one do it? The entire point of a backup is to help out in situations like this one, but it’s failed me so I need yourself to get it straightened out.

1 Like

If you configure as described in Using Object Storage for Uploads (S3 & Clones) you should be able to

 rake uploads:migrate_to_s3

If you want to stop using s3, then you can enter the rails console and set

  SiteSetting.include_s3_uploads_in_backups=true

Then take a backup, make sure that you do not have s3 configured in your app.yml and restore the backup. I think that this will restore backups to local.

But in either case, I would still recommend setting your keys and backup bucket in ENV variables in your app.yml file and then check that you can restore it to a new site.

2 Likes

Okay I think I got a little mixed up here.

I’m thinking the ideal thing to do would probably be to have all uploads local and save the backups (zips) to the S3. This way the backup is available on S3 should anything happen to the server but the backup in itself is self contained with no dependencies so it should be easy to restore it on a new server.

So if I understood correctly, I should follow these instructions:

If you want to stop using s3, then you can enter the rails console and set

  SiteSetting.include_s3_uploads_in_backups=true
Then take a backup, make sure that you do not have s3 configured in your app.yml and restore the backup. I think that this will restore backups to local.

and then

  1. disable the enable upload to S3 option in Admin → Settings → Files
  2. enable the backup to S3 option in the Admin → Settings → Backups page

Is that correct?

This is the part that confused me, why would I want to put the S3 configuration in the app.yml file?

So that you have access to your backups via a command line restore before you restore your backup. Otherwise, you have to set up an admin account and then configure S3 and then restore. Similarly, whatever settings you put in your database get overwritten when you restore the database.

I think that best practice is to configure S3 only via ENV variables in the app.yml file. It would probably make sense to make them be hidden settings, if not for hundreds of people who would be surprised that they had disappeared.

1 Like

Because you will have trouble restoring otherwise.

1 Like

How would one restore a backup from S3 using the command line? According to the instructions here: Restore a backup from command line
It says you can drop the backup file into the /var/discourse/shared/standalone/backups/default folder and then start a restore from the CLI. This is what I had done with your suggestion earlier (which eventually led to broken links unfortunately), but that does work.

How does one restore directly from S3 using the CLI?

cd /var/discourse
./launcher enter app
discourse restore

It’ll print the available backups you can then copy/paste the one that you want to do the restore.

2 Likes

Thanks, so it’ll read the S3 backups and list them an option.

Jay, to follow up on a suggestion you had made to move assets local:

I think you can set a hidden setting include_s3_uploads_in_backups to true and then make a backup and restore it when s3 is turned off to stop using S3.

Having S3 backups with them configured in app.yml means that you can do a command line restore with only the app.yml file (after cloning discourse and installing docker).

For the first step would I need to backup the S3 buckets or is this a bucket safe operaiton?

Well, atleast I figured out why my server crashed last night (and again today after a complete rebuild :frowning: , see this topic for details: Ubuntu 20.04 kernel update with docker causing a crash

2 Likes

So to get it up and running from a backup, I had to

  1. disable Settings → Filesenable s3 uploads
  2. Settings → Backupsbackup locationS3
  3. enable Settings → Backupsbackup with uploads

Then I took a backup and I was able to restore it successfully. However one things did break, all the attachments (files) not have invalid links. The images are all good, but attachments links like https://domain.com/uploads/short-url/phu1HOLvkE8LWpkKYfnMPSWsvHh.zip now give me a error

Oops! That page doesn’t exist or is private.

Is there a way to fix these short-url links?

You might try doing an html rebuild (aka rebake) on one of those topics to set if it fixes it.

Thanks. Is there a guide somewhere on how to issue the command to bake specific topics?