Ruby multi-CPU threading

Hi there, as I’ve mention in previous posts, I’m doing test runs of my Drupal → Discourse migration to have all the solutions in place before I ultimately take down the old site to migrate the production data with its ~2M posts. What I’ve learned is that on a fairly fast VPS with 3 vCPU cores, the import process takes forever, somewhere around 48 hours. And then I’ll probably have to do some more cleanup with rake tasks and/or rails c, and for anything that requires a rake posts:rebake it will take another 20 hours approx.

I don’t really understand the fundamentals of the Ruby toolchain. But if I throw more CPU cores at the job will it significantly reduce the amount of time that any one of these processes require to complete? For example will a bundle command or a rake command be able to divide its work between the available CPUs, or are the additional cores mainly useful to run multiple concurrent processes when multiple users are hitting the website?

1 Like

I’m off-topic, but when I was working on a forum migration with the same number of posts, I modified the import script to import only 1/100 or 1/1000 topics and posts.

It’s a faster way to see if your import is reliable and if tweaks or debugging are needed.

3 Likes

@Canapin Actually thanks a lot for mentioning that, I would really like to know how you did it. I’ve been wanting to do the same thing, but I threw out the idea because I assumed I would run into database inconsistencies with a partial import. So I ended up creating a skeleton Drupal test forum to test on. But I’d prefer to test a copy of the production DB.

I’m mainly concerned about the eventual final production migration; I’ll have to take the old forum offline or at least make it read-only, and it’s looking like at least a 48h downtime at best, unless throwing double the CPU cores at it would cut the time in half?

1 Like

The tasks that take a long time to restore are indeed multi-threaded. One caveat is that 2x the CPUs almost never is 2x the performance.

Other point is that usually the tasks for rake posts:rebake and the heavy lifting of the forum itself to recover and optimize the content can happen with the forum live. Which might reduce the time you need to have the forum either off-line or read-only and be able to offer a somewhat degraded experience.

My recommendation would be: first, test. Do the migration, see how much times it takes and how the forum looks without all the rebakes in place. If it’s good enough time the migration to end around the lower traffic time of your forum that way you earn some 4-10h of migration without a lot of people complaining.

3 Likes

Excellent, thanks for confirming this, I was wondering about this option as well.

Unfortunately, I haven’t written that down and I forgot… But if you know coding that shouldn’t be very difficult.
I might have tweaked the BATCH_SIZE and offset values among other things to alter the loop and make it skip batches of posts or something like that…

I can’t re-try now because I don’t have any forum to import right now, but I’ll make a quick tutorial next time because I think it’s quite useful.

1 Like

I want to mention two things.

  • Yes, CPUs matter, so just get a bigger VPS and run multiple sidekiq instances for rebaking and image processing, it will go faster
  • When your import is totally completed it’s always a good idea to do a backup / restore, it will give you better database performance.

These two together: get a big VPS for the import and when you’re done move it to a smaller production VPS (using backup and restore).

Generally an import will not require you to rebake posts afterwards.

3 Likes

Thanks a lot Richard for the reply. So which one(s) of these?

  • UNICORN_WORKERS
  • UNICORN_SIDEKIQS
  • DISCOURSE_SIDEKIQ_WORKERS

Interesting, haven’t seen this recommendation before. Does that reduce fragmentation or something?

Yeah, I was initially going to try to fix some [QUOTE] problems and Textile → Markdown conversion with regexp_replace() in the Postgres console and then rebake all posts, because the rake posts:remap commands were just too slow. But then I discovered that the regexp flavor that Postgres uses is not PCRE compatible, and there are just too many unexpected anomalies to rely on it. So I’m going to try to run the posts through Pandoc during the import process, which should allow me to get the imported site up and running in a presentable state and then fix smaller stuff like emoji keywords with rake posts:remap .

  • UNICORN_SIDEKIQS → number of processes (default 1)
  • DISCOURSE_SIDEKIQ_WORKERS → number of threads inside a process (default 5)

It reduces fragmentation and it fixes the fact that the Postgres stats can be skewed because of the importing.

2 Likes

:+1:

I haven’t seen this advice before either. If this is “always a good idea”, maybe it should be added to Pre-launch checklist after migrating from another platform?

I think it was Sam or Jeff who gave me this advice many years ago. I can’t find it anymore. Maybe we should check if it’s still a good idea and/or worth the effort :wink:

1 Like

By any chance could anybody share tips with me on the fastest way of re-running an import script and making it re-import the data? I’m trying to tweak some text substitution in the importer script, and when I don’t get it right I have to delete the Discourse database and ./launcher rebuild import, which takes quite a while. I’d like to make changes in my importer script and have it start over at the beginning again (I’m using a small skeleton mockup database of my site right now, so it’s very fast to run the importer).

Hmmm. I’m testing another import of my production forum data, this time on a fairly powerful VPS with 8 virtual cores and 16GB of RAM. I set:
UNICORN_SIDEKIQS=4
DISCOURSE_SIDEKIQ_WORKERS=20
UNICORN_WORKERS=16

With this it doesn’t seem to be taking advantage of all the cores during the import_topics stage:

Although it’s interesting that the CPU graph was pegged at over 600% (so ~6 out of 8 cores used at 100%) during the user_import stage.

I also noticed this env variable: RUBY_GLOBAL_METHOD_CACHE_SIZE=131072 would that be too small?

I think that during the user creation state there are more actions that are handled async by Sidekiq.
A large part of the import will unfortunately not benefit from parallelization, you should optimize for single core CPU speed instead.

Theoretically you could run different chunks of the topics import in parallel but it would require quite some refactoring of the importer and making sure everything is processed in order. Not worth it for a one off task with a few iterations.

2 Likes

I followed a combination of these two[1] guides[2] for importing with access to another Docker container running a copy of the source forum database in MySQL. But it dawned on me that instead of creating a separate import container I can just use a single app container and add the mysql-dep.tempate to it:

templates:
  - "templates/postgres.template.yml"
  - "templates/redis.template.yml"
  - "templates/web.template.yml"
  - "templates/web.ratelimited.template.yml"
  - "templates/web.ssl.template.yml"
  - "templates/web.letsencrypt.ssl.template.yml"
  - "templates/import/mysql-dep.template.yml"

This lets me have a functioning Discourse instance while the importer script is running. Is there any disadvantage to opening up the forum to the public as soon as all the users and categories are imported, and just let the users know with a banner that it will be a few days until it’s fully populated? I’m thinking that at the very least I could open it up after all the topics and posts are imported but before the private messages are imported, as the private messages alone will take a good 24h to import.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.