We have some ideas for the future, but right now the unfortunate truth is you have to just wait as long as it takes. We have a very fast server set up to perform the imports we do at Discourse.
This post I made in the discussion from phpBB 3 Importer (old) might be of interest to you:
Yes, this was an older version of in the import script but might still be relevant.
It turns out the import was more like just over 750K posts and 150K topics including private messages.
The sidekiq’s execute background tasks created by each imported post and it’s always worth a little look in /sidekiq to see if you have a backlog. That’s the reason I mention “at least 25 sidekiq’s running” to keep that those tasks down.
My import ran for well over 36 hours on a 64GB Memory, 20 Core Processor machine which I ended up using.
And the more cores I throw at it the better? Or is there a limit on that that will return benefit, assuming a fast SSD? Does the import function leverage hyper-threading? Thank you.
Not that much. Maybe one for the database and one for the importer (and a third for redis?)
A recent non-scientific test I did the same import on a couple machines. One was a Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz with 16GB ram. The other Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz with 32 GB. I’m pretty sure both have the same Samsung 850EVO (oh! but the first machine has 1GB). The i7 won. I think that it’s mostly because I had the SSD on a slow SATA port, but I haven’t tested again.
Thanks for the great insights. Just trying to make a judgement here before diving in. Downtime is a consideration, and we’re talking ~10 million posts here. So, limited to a single thread (is SideKiq multithreaded, may there be someway to reconfigure the script to hive off separate threads), we’re likely talking days here, even with latest, greatest cpu and reading from one ssd and writing to another?
The script can be rerun to import data that is new since the previous run. So you don’t need to take the forum down until you do the final import. Scripts that I’ve touched allow you to set an IMPORT_AFTER environment variable that will omit data from the import (otherwise it reads but skips the data).
A couple importers have a bulk importer which speeds things up, but it’s more complex to run.
You’re likely looking at a week or two of runtime alone for the first run.