Vanilla to Discourse Large Data Import (decreasing speed)

We have 26gb of data dump from discourse
1.3million users
3million topics
21million posts

our problem is we are importing at 500k/min but after a few minutes it decreases up to 2k/min

1 Like

You’ll need lots of ram. You might look at the bulk importers, but I don’t believe that there is one for vanilla.

2 Likes

hi Jay. We are using c5.4xlarge instance with AWS and at first it is at 500k/min and slows down after a few minutes.

The import script is restartable, but this is unfortunately just normal with the import scripts.

2 Likes

yup when i restart it just skips the data already imported but the same process decreases overtime :frowning:

1 Like

thanks for confirming this. :frowning: total of 31million of data will take a month or so if it keeps on decreasing. Any suggestion for this to improve? or it is just the way it is?

You need a CPU with fast single core speed which is quite hard to find in the cloud.

Or give the bulk import script a try. Importers for large forums

There is one for Vanilla: https://github.com/discourse/discourse/blob/master/script/bulk_import/vanilla.rb

3 Likes

we use c5.4xlarge from AWS
vCPU - 16 Memory (GiB) - 32
is this enough or should we upgrade?

sure, will try that bulk import script. Thanks!

You will need a CPU from the top of PassMark CPU Benchmarks - Single Thread Performance if you want to run the regular import script as fast as possible. I have no idea what you get on AWS or any other cloud provider with vCPUs. :man_shrugging:

3 Likes

You want to use the bulk importer.

2 Likes

whenever i tried the bulk import it stops there. Since the traceback stops at category ids
i tried changing the -1 to 0

@last_imported_category_id = imported_category_ids.max || -1
to
@last_imported_category_id = imported_category_ids.max || 0

I even tried to delete the category with -1 id then tried again. no luck

If you can hire extra help, contact @pfaffman at https://www.literatecomputing.com/ .

3 Likes