Migrate_from_s3 problems

I have two things to address next: The first is that the limit is helpful but is only a limit on the search space; next I want to be able to specify a max number of posts to modify. So I’ll want to add a max number of posts to modify as well as a limit for the query.

Also, when specifying max, it makes sense to be verbose about what is being processed for debugging purposes, so I’d like to make the output verbose if max is not nil — this will let people validate the process before continuing, since that’s the primary use case for this work.

I think what I’ll do is specific max as the first argument, and an optional limit as the second, because really the max is the most important thing; limit is just to make “one ping only” cheaper.

The second is non-upload: uploads. A bit over a year ago, I tried uploading a video to the Discourse site I was joining, as I was floundering around trying to figure out how to write the migration from Google+ to Discourse, and saw that what was written was https://#{SiteSettings.absolute_base_url}/original/3X/b/a/ba9e06ebc2f4397f26793bb5cd4e169308dd371d.mp4

Today, when I upload a video, I get something like ![file_example_MP4_480_1_5MG|video](upload://caJ9ykkpshw3PFK4464VUIPWJ4l.mp4) instead.

At least in my most recent test, migrate_from_s3 completely mangled those URLs, making them not even be URLs any more, so that definitely needs to be fixed. Then, I think that this task is unlikely in practice to run into ![text](non-upload-url-to-migrate) so as a first pass I’d like to just add the link markdown around the magic upload protocol, rather than making the regex handle both cases and be even harder to read as a result. But I might change my mind on that.

It looks like the video or audio tag is added in javascript by regex match, so I’ll have to copy the regexes in from app/assets/javascripts/discourse/app/lib/uploads.js into the task to identify them properly. I’ll include the source of the regexes so the next person to find them knows where to update them from. :stuck_out_tongue:

Tonight, I made some time to work on this, and I have a draft PR underway. It is not finished yet; I know there are bugs left in it. I don’t think I’ve changed any behavior for upload: pseudo-protocol URLs (normal for images) at this point, though I added a sanity check.

https://github.com/discourse/discourse/pull/10093

With the changes in this PR, I’ve successfully migrated both normal upload: pseudo-protocol uploads and videos that are currently referenced explicitly by reference to S3 (well, digital ocean spaces in my case). I’m using this to modify only one post at a time with this command:

bin/rake uploads:batch_migrate_from_s3[1,1000]

Note that this will not go beyond the first 1000 posts returned somewhat randomly from the database query; the low limit is just for speed doing the query while migrating just one post at a time, reviewing that post for correct behavior and then starting over to find another one. This command works this way only with the PR I’m currently working on!

I’m continuing to add diagnostic output as I work on this, and I am starting to think that it is important for more than development. I’m seeing a lot of transient download failures from digital ocean spaces where some but not all of the downloads in a post get migrated, which in the original will just print a . and keep going and then say Done but really the task won’t be done. I had to do something like five or six passes on one post before it migrated all the files. (I wasn’t counting because I thought I was debugging a local bug at first.) I’m expecting to have to run this migration repeatedly with the same limit until the diagnostics are clean. Therefore, I’m making verbose progress print only if max is set, but printing useful warning messages either way.

Right now, I’m using the following hack for discourse spaces intermittently failing downloads, which in practice has been improving my success rate tremendously (3 retries has been entirely sufficient across hundreds of migrated posts so far).

https://github.com/johnsonm/discourse/commit/7dfac12a2ea6ec04ba4e0616b4e0dbd1d806cff7

Also, I discovered that somehow we ended up with videos larger than the limit I set when I imported from Google+ — I have had to temporarily increase both SiteSettings.max_image_size_kb and SiteSettings.max_attachment_size_kb while doing one-shot migrations of a few oversized videos that it’s not clear to me how they ended up on the site, but I don’t want to break them now… No idea if the bug that allowed oversized uploads was in my import script, discourse, or merely my memory of what changes I made to settings over time. :wink:

Because much of what I’m migrating was an import from G+, some of my posts ended up failing current validations. I got a few Unhandled failure: Validation failed: Sorry, new users can only put one image in a post and didn’t initially understand why they didn’t recur. It turns out that the uploads were successfully moved to local, and they all used the upload: pseudo-protocol, so the raw didn’t change. However, post.save! still failed with that error on validations, which prevented post.rebake! from firing, so I have a few posts out of 30K with images that need to be rebaked; sadly, I have no record of which posts those are. I have now switched to post.save!(validate: false) as another fix, so this particular problem should not recur. I’m very glad I made the migration start bailing out on unhandled errors though, or this would have potentially done lots more damage than a few posts.

To keep my site usable, including delivering notifications, while running the migration, I don’t want to spam the sidekiq queues. I know naming things is one of two hardest problems in computer science, along with cache invalidation and off-by-one errors, but I’m proposing DISCOURSE_MIGRATION_MAX_ENQUEUED as an environment variable for how many total queue slots (not job slots) are allowed to be filled when proceeding to migrate another item after a rebake during a migration, to avoid spamming the queues, so that the site will continue to function. I have a patch that adds this, defaulting to zero, for all the per-post rebaking in lib/tasks/uploads.rake. I am using this on my production migration.

https://github.com/discourse/discourse/blob/59a761851b9c8786d3a9692f8c595372b0534f77/lib/tasks/uploads.rake

4 Likes