Migrating a large forum

zh99998 · January 12, 2016, 3:59pm

I’m migrating a many-years-old large forum from Discuz to discourse.
there is about 100k users, 3M posts and 150GB attachments.

I’m using the discuz_x import script to do this now, but it’s very slow, and may take many many days to finish it.

is there any advice for me writing a faster import script?

eviltrout · January 12, 2016, 5:15pm

We have some ideas for the future, but right now the unfortunate truth is you have to just wait as long as it takes. We have a very fast server set up to perform the imports we do at Discourse.

DeanMarkTaylor · January 12, 2016, 5:31pm

This post I made in the discussion from phpBB 3 Importer (old) might be of interest to you:

Yes, this was an older version of in the import script but might still be relevant.

It turns out the import was more like just over 750K posts and 150K topics including private messages.

The sidekiq’s execute background tasks created by each imported post and it’s always worth a little look in /sidekiq to see if you have a backlog. That’s the reason I mention “at least 25 sidekiq’s running” to keep that those tasks down.

My import ran for well over 36 hours on a 64GB Memory, 20 Core Processor machine which I ended up using.

ronan0 · December 5, 2018, 1:08pm

What’s more important to the speed of import - ram or CPU?

pfaffman · December 5, 2018, 1:11pm

Cpu and ssd speed are very important. You need sufficient ram, but it’s likely not important to have more than 8gb.

ronan0 · December 5, 2018, 3:55pm

And the more cores I throw at it the better? Or is there a limit on that that will return benefit, assuming a fast SSD? Does the import function leverage hyper-threading? Thank you.

pfaffman · December 7, 2018, 12:09am

Not that much. Maybe one for the database and one for the importer (and a third for redis?)

A recent non-scientific test I did the same import on a couple machines. One was a Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz with 16GB ram. The other Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz with 32 GB. I’m pretty sure both have the same Samsung 850EVO (oh! but the first machine has 1GB). The i7 won. I think that it’s mostly because I had the SSD on a slow SATA port, but I haven’t tested again.

codinghorror · December 7, 2018, 12:15am

No, you’re 100% right, fast CPU (single thread speed, not “m0ar cores”) and fast disk is the main thing.

ronan0 · December 7, 2018, 9:27am

Thanks for the great insights. Just trying to make a judgement here before diving in. Downtime is a consideration, and we’re talking ~10 million posts here. So, limited to a single thread (is SideKiq multithreaded, may there be someway to reconfigure the script to hive off separate threads), we’re likely talking days here, even with latest, greatest cpu and reading from one ssd and writing to another?

merefield · December 7, 2018, 9:37am

Why don’t you perform a dress rehearsal or two? Optimise that and then you also get more predictability.

pfaffman · December 7, 2018, 1:34pm

The script can be rerun to import data that is new since the previous run. So you don’t need to take the forum down until you do the final import. Scripts that I’ve touched allow you to set an IMPORT_AFTER environment variable that will omit data from the import (otherwise it reads but skips the data).

A couple importers have a bulk importer which speeds things up, but it’s more complex to run.

You’re likely looking at a week or two of runtime alone for the first run.

Topic		Replies	Views
Import from vbulletin to discourse forums Dev	5	2201	August 28, 2018
Ruby multi-CPU threading Migration rails-console , drupal	15	1472	January 19, 2023
Vanilla to Discourse Large Data Import (decreasing speed) Support	11	853	November 12, 2020
Lot of sidekiq jobs during data import Support	3	1605	April 2, 2019
Are there any tasks to do after a forum import to improve speed? Migration	2	699	December 12, 2020

Migrating a large forum

Related topics