Hi there, as I’ve mention in previous posts, I’m doing test runs of my Drupal → Discourse migration to have all the solutions in place before I ultimately take down the old site to migrate the production data with its ~2M posts. What I’ve learned is that on a fairly fast VPS with 3 vCPU cores, the import process takes forever, somewhere around 48 hours. And then I’ll probably have to do some more cleanup with rake tasks and/or rails c, and for anything that requires a rake posts:rebake it will take another 20 hours approx.
I don’t really understand the fundamentals of the Ruby toolchain. But if I throw more CPU cores at the job will it significantly reduce the amount of time that any one of these processes require to complete? For example will a bundle command or a rake command be able to divide its work between the available CPUs, or are the additional cores mainly useful to run multiple concurrent processes when multiple users are hitting the website?
@Canapin Actually thanks a lot for mentioning that, I would really like to know how you did it. I’ve been wanting to do the same thing, but I threw out the idea because I assumed I would run into database inconsistencies with a partial import. So I ended up creating a skeleton Drupal test forum to test on. But I’d prefer to test a copy of the production DB.
I’m mainly concerned about the eventual final production migration; I’ll have to take the old forum offline or at least make it read-only, and it’s looking like at least a 48h downtime at best, unless throwing double the CPU cores at it would cut the time in half?
The tasks that take a long time to restore are indeed multi-threaded. One caveat is that 2x the CPUs almost never is 2x the performance.
Other point is that usually the tasks for rake posts:rebake and the heavy lifting of the forum itself to recover and optimize the content can happen with the forum live. Which might reduce the time you need to have the forum either off-line or read-only and be able to offer a somewhat degraded experience.
My recommendation would be: first, test. Do the migration, see how much times it takes and how the forum looks without all the rebakes in place. If it’s good enough time the migration to end around the lower traffic time of your forum that way you earn some 4-10h of migration without a lot of people complaining.
Unfortunately, I haven’t written that down and I forgot… But if you know coding that shouldn’t be very difficult.
I might have tweaked the BATCH_SIZE and offset values among other things to alter the loop and make it skip batches of posts or something like that…
I can’t re-try now because I don’t have any forum to import right now, but I’ll make a quick tutorial next time because I think it’s quite useful.
Thanks a lot Richard for the reply. So which one(s) of these?
Interesting, haven’t seen this recommendation before. Does that reduce fragmentation or something?
Yeah, I was initially going to try to fix some [QUOTE] problems and Textile → Markdown conversion with regexp_replace() in the Postgres console and then rebake all posts, because the rake posts:remap commands were just too slow. But then I discovered that the regexp flavor that Postgres uses is not PCRE compatible, and there are just too many unexpected anomalies to rely on it. So I’m going to try to run the posts through Pandoc during the import process, which should allow me to get the imported site up and running in a presentable state and then fix smaller stuff like emoji keywords with rake posts:remap .
By any chance could anybody share tips with me on the fastest way of re-running an import script and making it re-import the data? I’m trying to tweak some text substitution in the importer script, and when I don’t get it right I have to delete the Discourse database and ./launcher rebuild import, which takes quite a while. I’d like to make changes in my importer script and have it start over at the beginning again (I’m using a small skeleton mockup database of my site right now, so it’s very fast to run the importer).
Hmmm. I’m testing another import of my production forum data, this time on a fairly powerful VPS with 8 virtual cores and 16GB of RAM. I set: UNICORN_SIDEKIQS=4 DISCOURSE_SIDEKIQ_WORKERS=20 UNICORN_WORKERS=16
With this it doesn’t seem to be taking advantage of all the cores during the import_topics stage:
I think that during the user creation state there are more actions that are handled async by Sidekiq.
A large part of the import will unfortunately not benefit from parallelization, you should optimize for single core CPU speed instead.
Theoretically you could run different chunks of the topics import in parallel but it would require quite some refactoring of the importer and making sure everything is processed in order. Not worth it for a one off task with a few iterations.
I followed a combination of these two guides for importing with access to another Docker container running a copy of the source forum database in MySQL. But it dawned on me that instead of creating a separate import container I can just use a single app container and add the mysql-dep.tempate to it:
This lets me have a functioning Discourse instance while the importer script is running. Is there any disadvantage to opening up the forum to the public as soon as all the users and categories are imported, and just let the users know with a banner that it will be a few days until it’s fully populated? I’m thinking that at the very least I could open it up after all the topics and posts are imported but before the private messages are imported, as the private messages alone will take a good 24h to import.