Data import script improvements for migrating large data soon


(Junaid Mailk) #1

Hello Guys,

Guys please share your thoughts on idea discussed below. Thank you.

We are migrating our forum from Vbulletin to Discourse. We have 596400 Users, 296000 Topics and 5 million posts. We tried the provided script and imported data of Groups, Categories, Users and Topics. 596400 User(without avatar picture) took around 29 hours and Topics took around 14hours Post took around 17 days. We ran the import script on machine with following specs

Ram : 32 GB Ram
CPU : Intel® Core™ i7-4770 CPU @ 3.40GHz with hyperthreading
Hard-Drive : 219 GB SSD
Cores : 4 physcial and 8 logical cores

Ruby version : ruby 2.1.2p95
Rails : 4.2
Discourse branch : v1.5.3
OS : CentOS Linux release 7.2.1511 (Core)

Problem: Import process is taking too much time. If we go with this model then we will have to wait for very long time in deployment model we have in back of our head. We imagined following model for deployment from vbulletin to discourse.

“We will set Vbulletin model in readonly mode and keep our users will be able to read post and can use mobile tapatalk app but can’t post or reply topic and post. Once the data imported successfully, discourse forum will be available on same urls and readonly mode will be changed to ‘off’. Now users can perform all available functionality.”

Solution: We wanted to reduce import time as much as we can so that our users don’t disturb for long period. We thought that we can reduce time by importing data of same type but independent in parallel using lot of machines. From independent data i mean that we will import all users on step 1 and on 2nd step we will import all topics and on 3rd step we will import all posts. We implemented this idea by using same script provided in discourse repository by doing modification of introducing start & end resultsets parameters to define data range and using Sidekiq. Let run this flow with a example to bring more clarity in this idea.

Lets assume following things
No of Users = 20k
No of machines running with above idea implemented = 10
BackgroundJob = Sidekiq
Time to insert 200 users = 1 minute
Batch a sidekiq job will process = 2k

With above assumptions, 10 sidekiq jobs will be created each with 2k users. Batch range will be following, see below. When sidekiq will process 1st job, it will pass start & end recordset parameter to default script(vbulletin) with parameter to restrict to only import ‘Users’ data with id range in 0-2000, similary next job will be processing 2000 - 4000 and some job will be processing 10000-12000 user and so on. By using this idea we are able to insert 2k user per minute with db server with high cores and optimized Postgresql server settings. All above data will be processed in 10 minutes.

Batch ranges
">= - <"
0 - 2000
2000 - 4000
4000 - 6000
8000 - 10000


16000 - 18000
18000 - 20000

I am still working to improve the no of inserts per minute to process large amound of data even soon.

Regards,
Junaid


(Jeff Atwood) #2

@techapj can offer advice, we migrated a 4 million post vBulletin forum (dating from 2002) to Discourse recently. It was a large migration.

I definitely recommend running on native hardware in this situation, and you will need lots of memory for the database. Your specified machine covers that part fine.

How old his this forum, e.g. what is the oldest post?


(Junaid Mailk) #3

Hello Jeff,

Thank you for quick reply. We setup our forum 13 years ago in July 2003, our first post was on 2003-07-19.

Regards,
Junaid


(Arpit Jalan) #4

In my experience, import speed on my Hackintosh (i5, 16GB RAM, 256GB SSD) was much better than on a beefy remote server. I also increased BATCH_SIZE to 2000.

It took ~4 days to import a vBulletin forum with 4 million posts.