We have 26gb of data dump from discourse
1.3million users
3million topics
21million posts
our problem is we are importing at 500k/min but after a few minutes it decreases up to 2k/min
We have 26gb of data dump from discourse
1.3million users
3million topics
21million posts
our problem is we are importing at 500k/min but after a few minutes it decreases up to 2k/min
You’ll need lots of ram. You might look at the bulk importers, but I don’t believe that there is one for vanilla.
hi Jay. We are using c5.4xlarge instance with AWS and at first it is at 500k/min and slows down after a few minutes.
The import script is restartable, but this is unfortunately just normal with the import scripts.
yup when i restart it just skips the data already imported but the same process decreases overtime
thanks for confirming this. total of 31million of data will take a month or so if it keeps on decreasing. Any suggestion for this to improve? or it is just the way it is?
You need a CPU with fast single core speed which is quite hard to find in the cloud.
Or give the bulk import script a try. Importers for large forums
There is one for Vanilla: https://github.com/discourse/discourse/blob/master/script/bulk_import/vanilla.rb
we use c5.4xlarge from AWS
vCPU - 16 Memory (GiB) - 32
is this enough or should we upgrade?
sure, will try that bulk import script. Thanks!
You will need a CPU from the top of PassMark CPU Benchmarks - Single Thread Performance if you want to run the regular import script as fast as possible. I have no idea what you get on AWS or any other cloud provider with vCPUs.
You want to use the bulk importer.
whenever i tried the bulk import it stops there. Since the traceback stops at category ids
i tried changing the -1 to 0
@last_imported_category_id = imported_category_ids.max || -1
to
@last_imported_category_id = imported_category_ids.max || 0
I even tried to delete the category with -1 id then tried again. no luck
If you can hire extra help, contact @pfaffman at https://www.literatecomputing.com/ .