Upload remapping is a bit too enthousiastic

When a backup is restored to a database with a different name, there is a neat little piece of code in restorer.rb that attempts to fix this.

https://github.com/discourse/discourse/blob/master/lib/backup_restore/restorer.rb#L448

However, this renames ALL upload filenames, not just the ones that were extracted from the archive. If the database name occurs in an S3 path (which obviously does not change), it changes it as well, causing all images to fail loading.

3 Likes

What do you think here @tgxworld?

Would changing this to include the trailing slash have helped?

DbHelper.remap("uploads/#{previous_db_name}/", "uploads/#{current_db_name}/")

That feels more correct in any case.

No - I’m talking about filenames like //discourse-cloud-file-uploads.s3.dualstack.us-west-2.amazonaws.com/standard11/uploads/DBNAME/original/1X/abcd.....jpg

So I think we’d need to exclude uploads starting with // or something like that.

We’re experiencing the pain of this quite abit while migrating local uploads to S3. Unfortunately, no amount of hacky remaps can fix this properly. What we’re currently working on at the moment is to remove/disallow upload links in Post#raw. The short upload url, ![test](upload://asdikajsdiasds.png), will be used as the public interface for access to uploads in raw.

Once that is done, we will only have to remap the urls in the uploads table and do a rebake of affected posts.

6 Likes

“only” have to do a rebake… yes, on one hand I think it’s good to get rid of the hacky remaps, on the other hand it seems like the backup/restore and migrate from/to S3 processes are getting more and more complicated and hard.

See also a number of issues in Migrate_from_s3 problems which are kind of related but they’re not all covered by your proposed solution.

I do not think so, the moves will get significantly simpler. Walk the uploads table and upload all files to new s3 bucket, rebake and you are done.

They do though become more expensive computation wise cause a rebake is required.

This is a cost I think is worth bearing given we get a correct move at the end of the process. But, we have the flexibility of still doing database replace :roller_coaster: as we do today for the cooked column and post revisions if that is how we roll, the key change in the process is that we leave posts.raw alone and then have 0% of corruption there. posts.raw and posts_uploads auto generated map will always be stable.

4 Likes

Ok - good point.

I think a significant improvement would be to walk the uploads table instead of posts (see the topic I linked above) and that would indeed leave the flexibility to replace in posts.cooked

Is this still an issue @sam?

upload:// links in raw markdown help this a lot. I am not sure if @gerhard amended it so we queue a complete rebake post restore. (I think we do)

1 Like

It depends. We do a rebake of all posts that contain uploads when optimized images aren’t included in backups (which is the default). Otherwise we rely only on the remaps.

2 Likes

We should probably change it to unconditionally rebake, its safer.

2 Likes

Sure, I just changed it. It will be included in the big refactoring of backups & restores.

5 Likes