Migrate_from_s3 problems

When migrating from S3 to local storage, we see a number of issues.

The main issue is that the migrate_from_s3 rake task is not taking the Uploads table as a starting point, but the posts. This causes it to skip a lot of uploads which are being left on S3.

  • uploads used as logo’s because they are not referenced in a post
  • uploads for avatars because they are not referenced in a post
  • uploads that are (for some reason) referenced by their CDN URL in raw because they do not match the regex that is being used to identify the uploads
  • uploads that are on non-AWS S3 storage because they do not match the regex because it requires amazonaws in the URL
  • uploads that for some reason do not match the second, different regex (we’re seeing this with non-image uploads like mp3 files, not sure why this is happening)
2 Likes

We’re working on improving the way we associate uploads to posts (by using the upload:// scheme) that will make these storage migrations much more bulletproof. There’s little point to fixing these rake tasks before that’s done.

7 Likes

I recently opened a new PR to allow migrating fewer posts at a time (such as testing one at a time) and added tests for the task.

https://github.com/discourse/discourse/pull/9933

It feels like it makes sense to start with posts, since you want to rebake posts to change the URL in the cooked post after each migration; migrating all the uploads and then starting the rebakes would leave the site very broken for potentially a long time for a large site (I need to migrate about 100GB from S3 to local, so I care). But what I wrote might help start writing a migrate_uploads_from_s3 task that would be run after migrate_from_s3 in order to clean up uploads that weren’t part of posts.

@zogstrip What’s the current status of “We’re working on…” — is this still in flux or is this migration now worth paying attention to?

5 Likes

We’ve done lots of work on the uploads front to ensure we have better posts <=> uploads associations.

These migration tasks can be improved at will :+1:

2 Likes

I’ll find out whether it’s worth my time looking into non-post uploads after I am able to use the batch uploads to migrate my site in off hours! :slight_smile: Thanks for the PR review!

1 Like

I have two things to address next: The first is that the limit is helpful but is only a limit on the search space; next I want to be able to specify a max number of posts to modify. So I’ll want to add a max number of posts to modify as well as a limit for the query.

Also, when specifying max, it makes sense to be verbose about what is being processed for debugging purposes, so I’d like to make the output verbose if max is not nil — this will let people validate the process before continuing, since that’s the primary use case for this work.

I think what I’ll do is specific max as the first argument, and an optional limit as the second, because really the max is the most important thing; limit is just to make “one ping only” cheaper.

The second is non-upload: uploads. A bit over a year ago, I tried uploading a video to the Discourse site I was joining, as I was floundering around trying to figure out how to write the migration from Google+ to Discourse, and saw that what was written was https://#{SiteSettings.absolute_base_url}/original/3X/b/a/ba9e06ebc2f4397f26793bb5cd4e169308dd371d.mp4

Today, when I upload a video, I get something like ![file_example_MP4_480_1_5MG|video](upload://caJ9ykkpshw3PFK4464VUIPWJ4l.mp4) instead.

At least in my most recent test, migrate_from_s3 completely mangled those URLs, making them not even be URLs any more, so that definitely needs to be fixed. Then, I think that this task is unlikely in practice to run into ![text](non-upload-url-to-migrate) so as a first pass I’d like to just add the link markdown around the magic upload protocol, rather than making the regex handle both cases and be even harder to read as a result. But I might change my mind on that.

It looks like the video or audio tag is added in javascript by regex match, so I’ll have to copy the regexes in from app/assets/javascripts/discourse/app/lib/uploads.js into the task to identify them properly. I’ll include the source of the regexes so the next person to find them knows where to update them from. :stuck_out_tongue:

Tonight, I made some time to work on this, and I have a draft PR underway. It is not finished yet; I know there are bugs left in it. I don’t think I’ve changed any behavior for upload: pseudo-protocol URLs (normal for images) at this point, though I added a sanity check.

https://github.com/discourse/discourse/pull/10093

With the changes in this PR, I’ve successfully migrated both normal upload: pseudo-protocol uploads and videos that are currently referenced explicitly by reference to S3 (well, digital ocean spaces in my case). I’m using this to modify only one post at a time with this command:

bin/rake uploads:batch_migrate_from_s3[1,1000]

Note that this will not go beyond the first 1000 posts returned somewhat randomly from the database query; the low limit is just for speed doing the query while migrating just one post at a time, reviewing that post for correct behavior and then starting over to find another one. This command works this way only with the PR I’m currently working on!

I’m continuing to add diagnostic output as I work on this, and I am starting to think that it is important for more than development. I’m seeing a lot of transient download failures from digital ocean spaces where some but not all of the downloads in a post get migrated, which in the original will just print a . and keep going and then say Done but really the task won’t be done. I had to do something like five or six passes on one post before it migrated all the files. (I wasn’t counting because I thought I was debugging a local bug at first.) I’m expecting to have to run this migration repeatedly with the same limit until the diagnostics are clean. Therefore, I’m making verbose progress print only if max is set, but printing useful warning messages either way.

Right now, I’m using the following hack for discourse spaces intermittently failing downloads, which in practice has been improving my success rate tremendously (3 retries has been entirely sufficient across hundreds of migrated posts so far).

https://github.com/johnsonm/discourse/commit/7dfac12a2ea6ec04ba4e0616b4e0dbd1d806cff7

Also, I discovered that somehow we ended up with videos larger than the limit I set when I imported from Google+ — I have had to temporarily increase both SiteSettings.max_image_size_kb and SiteSettings.max_attachment_size_kb while doing one-shot migrations of a few oversized videos that it’s not clear to me how they ended up on the site, but I don’t want to break them now… No idea if the bug that allowed oversized uploads was in my import script, discourse, or merely my memory of what changes I made to settings over time. :wink:

Because much of what I’m migrating was an import from G+, some of my posts ended up failing current validations. I got a few Unhandled failure: Validation failed: Sorry, new users can only put one image in a post and didn’t initially understand why they didn’t recur. It turns out that the uploads were successfully moved to local, and they all used the upload: pseudo-protocol, so the raw didn’t change. However, post.save! still failed with that error on validations, which prevented post.rebake! from firing, so I have a few posts out of 30K with images that need to be rebaked; sadly, I have no record of which posts those are. I have now switched to post.save!(validate: false) as another fix, so this particular problem should not recur. I’m very glad I made the migration start bailing out on unhandled errors though, or this would have potentially done lots more damage than a few posts.

To keep my site usable, including delivering notifications, while running the migration, I don’t want to spam the sidekiq queues. I know naming things is one of two hardest problems in computer science, along with cache invalidation and off-by-one errors, but I’m proposing DISCOURSE_MIGRATION_MAX_ENQUEUED as an environment variable for how many total queue slots (not job slots) are allowed to be filled when proceeding to migrate another item after a rebake during a migration, to avoid spamming the queues, so that the site will continue to function. I have a patch that adds this, defaulting to zero, for all the per-post rebaking in lib/tasks/uploads.rake. I am using this on my production migration.

https://github.com/discourse/discourse/blob/59a761851b9c8786d3a9692f8c595372b0534f77/lib/tasks/uploads.rake

4 Likes

@zogstrip Do you mind reviewing this PR since you have context from recent review on my last one in this area? https://github.com/discourse/discourse/pull/10093

I’ve included in it the fixes I’ve been doing while running this relatively large migration. I haven’t tried to add tests for every fix; I’m not sure how to inject every form of error. But at least the new functionality is tested.

@RGJ I think that my PR as it stands now might deal with all but your first two points, except that I’m not sure about CDN. My site used CDN and migrated videos that had CDN urls but that might have been a side effect of the naming with discourse spaces. If you have additional cases, hopefully my PR will give you an easy scaffolding for adding to the regex and adding test cases for the additional variations.

I think that it is right to first migrate by post, because after migrating the uploads in a post the post needs to be rebaked so that the cooked post has the right URLs in it. After I finish migrating my posts (which might take less than two weeks now that I’ve changing my rate-limiting to check the queue length directly), I’ll be looking at any remaining work to do to clean up.

Because multiple posts can share references to the same content if more than one person upload the same file, there needs to be a second pass that checks the cooked data for old URLs and rebakes those posts to pick up new location. It can use the same rate-limiting to avoid soaking queues.

I probably won’t see any broken logos at makerforums because we tweaked branding after we stopped putting new content in “s3” (for us, digital ocean spaces) but I probably will see a bunch of uploads still in S3 for avatars at least. Migrating an uploads not associated with a post should be started only after all posts have been migrated, and I’ll probably have to write that up after I finish migrating uploads in posts.

@pfaffman I see Bizarre Problems with migrate_from_s3 - #5 which describes errors that didn’t repeat. Without my fixes in the current PR, errors are silently eaten, including validation failures. I think work here will address at least some of the problems you saw then.

@hosna the problems you raise in What does rake uploads:migrate_from_s3 exactly do? are either partially or completely solved so far in this PR. If they aren’t completely solved, I have added tests that will make it easier to add further tests to validate additional fixes.

@sam since you put the 2.6 label on the PR, I’m assuming it won’t get merged for at least a few days; should I pull my rate-limit feature work into the PR along with the fixes? Or do you have a preference to keep fixes and feature work in separate PRs? I can do either way. The rate limiting feature is working really well; I’m migrating about three times as fast, without site availability impact, now that I’m waiting for the sidekiq queue to drain, so it makes sense to pull it in, I think, if it’s something that is normally accepted in PRs. Otherwise, I need to wait the PR on merging the work it is based on, so either way it would be good to hear.

I DRYed out my migration rate limit patch and pulled it into the PR. It’s working in practice, and sar tells me that I’m seeing nearly zero idle time continuously, while the site continues to function, during the live migration. One benefit of the batch mode is that I can check for new Discourse versions after each full batch of migrations; I upgraded my site to 2.6.0beta1 at the first opportunity after it was cut, and have been running the migrations successfully on 2.6.0beta1 with my migration PR on top since the update.

I think that the PR is ready for review now; I’ll be planning to submit another PR for the last few stages, but getting this in place will improve the general migration experience for everyone even before I finish out the last pieces.

5 Likes

Well, that was a long time ago. . .

I’m anxious to see this worked out. Right now I’ve got several sites on multisite that need to pull images from s3 (right now, I think some images are local and others are on s3), and then push to s3.

1 Like

@pfaffman It’s probably a good thing that when I started my migration project I didn’t know what I since think I have figured out. It looks like if I had used minio client to copy all of the S3 bucket into the local uploads folder, modified all Upload.url to nil in the rails console, and rebaked the site, everything would have been migrated in hours without having to regenerate all the images. (Instead, I’m rate-limited by running all the image conversions again, as if local CPU is cheaper than just copying all the processed images from S3.)

And then, if it had been that easy, I wouldn’t have done any of this work to make migrations more reliable in general, and no one else would have benefited. :slight_smile:

4 Likes

Oh. That sounds like what I want to know. Maybe I’ll give that a shot.

1 Like

I would… definitely take a backup before trying that route, since I’m just speculating. I don’t want to mislead you.

Oh, and one more thing: that would have completely failed for audio and video uploads, and I wouldn’t have noticed until later, and then I would have been trying to figure it out and write custom code. So if you have audio and video uploads, you definitely want to start there, and it won’t work right without the PR that’s open and won’t merge until after the 2.5 release since @sam tagged it 2.6.

2 Likes

My apologies for joining the conversation this late. I just looked at your PR and I am wondering why you are recreating the Upload model instead of just changing their URL ? And why not just iterate through the OptimizedImages and doing the same?

2 Likes

@RGJ I didn’t change that at all; recreating Upload was a bug fix from @zogstrip in Cannot execute the rake uploads:migrate_from_s3 - #11

I’m just trying to make the code that was already there work, and I don’t know discourse internals well; I’ve been stumbling around in the dark a lot. My only ruby experience is the few PRs I’ve done for discourse. Following the pattern of the existing code does seem to be not the most efficient route (see my conversation with @pfaffman above about short-circuiting), I completely agree. As you can tell from the fact that earlier today I didn’t even realize that OptimizedImages.url would also need to be modified to a /uploads path and etag set to nil (and I don’t know what else) I’m still flying blind.

Still need to iterate across posts first at least for fixing up old literal URLs in posts. Still need some of the other fixes, like not revalidating posts and not silently eating errors. Still want to rate limit, I think, to limit impact on live sites.

For your first two concerns involving non-pots updates, here’s my work in progress, not yet tested on a live site (commit) that might help, but which I won’t get to testing on my live site until all the post upload migration is finished.

Tweaking something that used to work was all I had appetite for. If you’d like to do a faster migrate I’m all for it; it might make sense to merge my slower-but-at-least-better work as a pareto improvement, and then you could completely replace it with something much better and I’d be the first to celebrate, even if I’m no longer in a position to use it at that point.

1 Like

@mcdanlj Thank you for your explanation. I was sincerely asking a question though, not implying that there are any issues now. I don’t know if the way I suggested is any “better” at all - maybe it would introduce a lot of new issues. I assume that the code like it is now - no matter who wrote it - is like it is for a reason.

3 Likes

@RGJ likewise, I am sincere that if anyone knows with confidence how to avoid the boondoggle of rebuilding all the images, I’m all for it, and if I were an experienced Discourse developer and knew how to do it easily I’d be likely to have done that in the first place.

It’s my guess that migrating away from S3 is not a common case for those with lots of images, so it’s hard for it to be worth the time, relative to doing something to add more generally useful features to Discourse. I had approximately 100GB of uploads from importing from Google+ into MakerForums and we chose S3 with the idea that it was possible that lots of people would move from those communities to MakerForums, but in the end only a few communities moved actively, so continued growth didn’t support staying in S3 after all. I expect that’s a fairly extreme outlier, and reprocessing the images is a one-time cost.

Finally, I’m very grateful for you opening this topic in the first place, because otherwise I could very easily have missed non-post uploads.

@RGJ You are clearly right that destroy-and-recreate isn’t good. I added a check, raise "Error: upload url #{url} changed to #{new_upload.short_url}, should be unchanged." if url != new_upload.short_url which at least reports problems with corrupted source. Today’s Digital Ocean Spaces outage cause it to send me corrupt data, and it tripped this error for many uploads — I don’t know how many. But because it had already destroyed the old upload, I now have some corrupted pages, and I didn’t realize it until I had lost the scrollback with the logs telling me which pages are corrupted, so I can’t even go in and fix them by hand from a backup.

So while my work is an improvement over the old way in that it at least reports some errors, it would be much better to download the files, check the SHA1s of the downloaded files, and write them to local files. In the meantime, I have modified my PR to stop the migration on any unhandled error to at least limit data loss.

I still think that my work should be merged as a pareto improvement, since it doesn’t make it worse and does make it better, but it would be rather better to just do it the right way. I think that would be to write lib/file_store/from_s3_migration.rb to parallel lib/file_store/to_s3_migration.rb but I’m not up to it.

Since my PR hasn’t been reviewed yet, I’ve piled more onto it. I added an uploads:report_missing_uploads task that looks through raw for instances of upload://... where the upload object isn’t present. At least in my case, with backups, I’ll be able to look through my backups and find the orphaned files and restore them to the site, then rebake those posts to restore the images that went missing. I have found 678 of them to search for in backups, of which I have so far found all but 10, so I’m glad I wrote the test! I’ll need to deal with that before I can go into the phase, in which I rebake remaining posts.

I have completed my first phase of migration, not including rebaking affected posts with shared uploads, and not including non-post uploads. After I complete another remote backup, I plan to test out those features too and add them to my PR.

I have now started testing the next phases of migration, rebaking and non-post-upload migration. My first lesson was that rebaking as the next step fails due to non-post-upload avatars in quoted posts, so rebaking posts needs to be the last step.

Next discovery: Thank you foreign key constraints! I discovered while migrating non-post uploads that I need to migrate at least profiles before migrating non-post uploads generally.

ActiveRecord::InvalidForeignKey: PG::ForeignKeyViolation: ERROR:  update or delete on table "uploads" violates foreign key constraint "fk_rails_1d362f2e97" on table "user_profiles"

Profile migration was tricky because I had to migrate two uploads attached to the profile, but in some cases people used the same image for both, so I had to do that work in three stages. I have successfully migrated all the profiles on my site with s3 urls.

I have migrated all of my non-post, non-profile uploads. @RGJ if you want to try out the branch I posted, it is up to date with my work. Everything in it has now been used for a completed site migration.

@CvX I don’t know when you’ll have a chance to review, but without my PR, migrate_from_s3 definitely is full of silent data destruction bugs. What I’ve done isn’t perfect, but it does protect against a lot of bugs that I’ve run into in practice.

I’ve done all the work I currently intend to do here. I have migrated my site, which means that I can’t even effectively evaluate any requested changes. If you reject this PR, I strongly suggest that you instead delete the existing migration because it will silently destroy data for users.

With respect to the PR, I sometimes get test failures in CI (as the current version) something like this:

7867 examples, 0 failures, 11 pending, 1 error occurred outside of examples

##[error]Process completed with exit code 1.

This doesn’t reproduce for me locally (so far) and I see no information in the CI logs to say what is wrong in CI, so I’m not going to waste any more of my time trying to screw around with exploring CI output for hidden information.

One last note: After completing the migration, we discovered that some users’ Profile Pictures were lost in the migration, and they are restoring them manually. I’m guessing that additional code would have been required to have succeeded in that, but since I didn’t figure that out until it was too late, I don’t have a test case to validate that. So that’s a remaining bug that might bite in some configurations, but I’m no longer in a position to write that additional fix.

5 Likes

Since my PR fixing the silent data corruption that is migrate_from_s3 never got so much as a glance, can migrate_to_s3 at least be documented as a one-way street?

Right now, it’s frankly a trap for young players.

@CvX @zogstrip What’s your preference? Look at the PR or just remove migrate_from_s3 entirely and be honest about the fact that it’s a one-way street?

5 Likes

Sorry, I’m at fault here. I’ll review the PR this week.

7 Likes

The review content was basically that the current way is so completely wrong that it needs to be rebuilt completely from scratch and in a different way; that the attempt to fix it in the way it was previously written is not OK. I am not up for that, so I really think that migrate_from_s3 should be removed entirely and migrate_to_s3 marked that it’s a one-way door that you better really mean to stick with forever. Right now having migrate_from_s3 in the source at all is a data integrity fault in Discourse.

I didn’t get timely review while I was working on the migration process, and now that I have long since migrated (with some broken results, like some broken avatars), I am without a meaningful test bed. I have zero confidence that I would do it right, so I have bowed out of this fight. It’s up to the next poor schmuck who thought it would be a good idea to put images in some S3-alike object storage service to resolve the problem. Sorry!

2 Likes

@CxV Since you rejected the previous migration as done entirely incorrectly and needing to be started over, please review and merge this removal of the data corruption bug:

https://github.com/discourse/discourse/pull/11703

2 Likes

As has been noted, I ran into some problems with the migration I did. This supports @CvX 's point that I wrote the migration completely wrong, and until someone does it right, the removal of the existing migration full of silent data corruption really should be merged. (It’s a really simple patch, review should be trivial. The sooner it is merged, the less likely someone is to destroy their site trying to walk backwards through a one-way door. I’m not sure how to plead more strongly for a merge; if I did I would. Edit: Thanks @Cvx for the approval, merge, and followup to make sure that mass migration is marked as irreversible.)

Here are known problems I ran into:

  • At least the discobot avatar didn’t show; all discobot participation showed the generic person avatar (light gray head-and-shoulders on white background).
  • Some other people’s avatars were not migrated correctly. Many were fixed manually.
  • When the site was being configured, the original administrator experimented with changes to “S3” configuration (in our case, digital ocean spaces, with and without CDN) leading to some orphaned optimized images that didn’t get fixed in the migration.

For anyone who tried using my branch and had some images go away like this, I ran a hack in the rails console and it fixed some things. This is not the right way to do this, I just don’t know the model well enough to do it right. My ruby-foo is very weak; I don’t touch ruby outside of my small contributions to Discourse.

But, such as it is, this terrible hack fixed broken images (from the initial configuration mistakes) and at least brought back the discobot avatar (and possibly others, but all of those of which I’d previously been aware had already been repaired).

What doesn’t show here is everything I did between steps, looking at each set as I went. This was part of a “explore what is there in the rails console” session and not actually a script I wrote and ran. This has no error handling, and I took a complete site backup before I started doing any of this work.

I don’t know how to say more strongly that this code is not the idiomatic ruby on rails way of doing this repair. I just wanted to share how I repaired at least some of the damage I did to myself in my previous attempt to migrate my site off digital ocean spaces.

require 'set'

uploadids = Set.new
optimizedimages = Set.new
OptimizedImage.where("url like '%UNIQUEPARTOFS3URL%'").each do |oi|
  uploadids.add(oi.upload_id)
  optimizedimages.add(oi)
end

postids = Set.new
uploadids.each do |u|
  PostUpload.where(upload_id: u).each do |pu|
    postids.add(pu.post_id)
  end
end

optimizedimages.each do |oi|
  oi.delete
end

postids.each do |pid|
  Post.where(id: pid).each do |p|
    p.rebake!
  end
end
2 Likes