Migrating a large forum


(zh99998) #1

I’m migrating a many-years-old large forum from Discuz to discourse.
there is about 100k users, 3M posts and 150GB attachments.

I’m using the discuz_x import script to do this now, but it’s very slow, and may take many many days to finish it.

is there any advice for me writing a faster import script?


(Robin Ward) #2

We have some ideas for the future, but right now the unfortunate truth is you have to just wait as long as it takes. We have a very fast server set up to perform the imports we do at Discourse.


(Dean Taylor) #3

This post I made in the discussion from phpBB 3 Importer (old) might be of interest to you:

Yes, this was an older version of in the import script but might still be relevant.

It turns out the import was more like just over 750K posts and 150K topics including private messages.

The sidekiq’s execute background tasks created by each imported post and it’s always worth a little look in /sidekiq to see if you have a backlog. That’s the reason I mention “at least 25 sidekiq’s running” to keep that those tasks down.

My import ran for well over 36 hours on a 64GB Memory, 20 Core Processor machine which I ended up using.


#4

What’s more important to the speed of import - ram or CPU?


(Jay Pfaffman) #5

Cpu and ssd speed are very important. You need sufficient ram, but it’s likely not important to have more than 8gb.


#6

And the more cores I throw at it the better? Or is there a limit on that that will return benefit, assuming a fast SSD? Does the import function leverage hyper-threading? Thank you.


(Jay Pfaffman) #7

Not that much. Maybe one for the database and one for the importer (and a third for redis?)

A recent non-scientific test I did the same import on a couple machines. One was a Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz with 16GB ram. The other Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz with 32 GB. I’m pretty sure both have the same Samsung 850EVO (oh! but the first machine has 1GB). The i7 won. I think that it’s mostly because I had the SSD on a slow SATA port, but I haven’t tested again.


(Jeff Atwood) #8

No, you’re 100% right, fast CPU (single thread speed, not “m0ar cores”) and fast disk is the main thing.


#9

Thanks for the great insights. Just trying to make a judgement here before diving in. Downtime is a consideration, and we’re talking ~10 million posts here. So, limited to a single thread (is SideKiq multithreaded, may there be someway to reconfigure the script to hive off separate threads), we’re likely talking days here, even with latest, greatest cpu and reading from one ssd and writing to another?


#10

Why don’t you perform a dress rehearsal or two? Optimise that and then you also get more predictability.


(Jay Pfaffman) #11

The script can be rerun to import data that is new since the previous run. So you don’t need to take the forum down until you do the final import. Scripts that I’ve touched allow you to set an IMPORT_AFTER environment variable that will omit data from the import (otherwise it reads but skips the data).

A couple importers have a bulk importer which speeds things up, but it’s more complex to run.

You’re likely looking at a week or two of runtime alone for the first run.