Bypass UploadCreator for Import

I’m working on an import to Discourse using a bulk importer. This works very well for topics and posts, but right now the slow part is files. We have about 50,000 users with avatars and while the user data imports to the DB in just a few seconds, the avatars are taking hours to import. It is only processing about one upload per second.

Is there any way to speed this up? I’m not sure what part of this process is slowest. If there is no avatar file found (photo_filename didn’t exist) then it executes very quickly, but I’m getting a bit lost trying to dig into the UploadCreator class that is ultimately invoked by this importer code.

We have over 600,000 attachments so I’m very concerned how long that will take to import using the same create_upload call.

        upload = create_upload(u.id, photo_filename, File.basename(photo_filename))
        if upload.persisted?
          u.import_mode = false
          u.create_user_avatar
          u.import_mode = true
          u.user_avatar.update(custom_upload_id: upload.id)
          u.update(uploaded_avatar_id: upload.id)
        else
          puts "Error: Upload did not persist for #{u.username} #{photo_real_filename}!"
        end
2 Likes

Any idea on this @neounix since you ran a big bulk importer once?

Thanks to the bulk importer we got 26 million posts down from a week to about 2 hours. The sore spot is now attachments taking multiple days.

Hey @TheDarkWizard

I did not use the Discourse scripts to move over the actual files.

We used normal file transfer utilities like tar, gzip, and sftp, rsync, etc.

Frankly speaking, we used various pieces of different Discourse (migration) scripts. but ended to writing more than half of all the code we used when we migrated; because we did months of gsub() code to clean (the review) the decades of “coding” posts, reviewed by moderators who had posted a lot of code over the years and everyone wanted their code to be perfect with zero syntax issues!

We thought the scripts provided by Discourse were a great starting point and we used them extensively; and we also wrote a lot of our own based on those scripts as well.

HTH

I’m sorry perhaps my question was missed. We don’t need instructions on how to move files over into the server environment where the import is happening. We have a bulk importer script that @Ghan is writing and we are trying to figure out how to make attachments go faster. Switching from the normal importer to a bulk importer made it so that importing posts went from a week to about two hours. I was hoping that someone could point in the right direction as to how to properly do attachments.

Sorry if I read your question wrong and my reply was not helpful.

Anyway, I am sure you can figure it out. It’s not rocket science (it’s only software) :slight_smile: and you guys are smart guys.

Best of luck. Sorry not to be more helpful. We completed our migration in 2Q 2020 and it (the migration task) is far in our rear-view mirror.

1 Like

Fair enough!

Your site looks great :slight_smile:

1 Like

I don’t think there is a similar silver bullet. Since the uploads don’t count on previous posts being processed, you could contrive to run multiple processes (say each one would do a different date range) to reduce the time by a factor of the number of cpus you can throw at it (given that the database and file system aren’t the bottleneck).

It seems that as posts are processed for attachments, there are a number of sidekiq jobs that get spun up to deal with some other processing on these posts. As a result, even a single process working on attachment imports manages to slowly drive the server to over 40 load average even with 8 cores. (I increased the number of sidekiq workers to handle the load.)

I might be able to stop the unicorn service until the import is complete, but that is just shifting the load to a later time. It seems like the processing has to be done one way or the other.

That is a fundamental truth.