Importers for large forums

mtawil · July 24, 2017, 11:16am

Yeah, I edited it, and it works.
But now I face a new issue when importing users:

Importing users...
script/bulk_import/vbulletin.rb:363:in `strptime': invalid date (ArgumentError)
	from script/bulk_import/vbulletin.rb:363:in `parse_birthday'
	from script/bulk_import/vbulletin.rb:78:in `block in import_users'
	from /var/www/discourse/script/bulk_import/base.rb:438:in `block (2 levels) in create_records'
	from /usr/local/lib/ruby/gems/2.4.0/gems/rack-mini-profiler-0.10.5/lib/patches/db/mysql2.rb:6:in `each'
	from /usr/local/lib/ruby/gems/2.4.0/gems/rack-mini-profiler-0.10.5/lib/patches/db/mysql2.rb:6:in `each'
	from /var/www/discourse/script/bulk_import/base.rb:437:in `block in create_records'
	from /usr/local/lib/ruby/gems/2.4.0/gems/pg-0.20.0/lib/pg/connection.rb:160:in `copy_data'
	from /var/www/discourse/script/bulk_import/base.rb:436:in `create_records'
	from /var/www/discourse/script/bulk_import/base.rb:178:in `create_users'
	from script/bulk_import/vbulletin.rb:72:in `import_users'
	from script/bulk_import/vbulletin.rb:25:in `execute'
	from /var/www/discourse/script/bulk_import/base.rb:33:in `run'
	from script/bulk_import/vbulletin.rb:377:in `<main>'

zogstrip · July 24, 2017, 11:18am

Can you share some examples of birthday dates? Maybe the format isn’t the same.

mtawil · July 24, 2017, 11:23am

The incorrect date is (02-29-1989)

mtawil · July 24, 2017, 11:33am

I know it’s the wrong date, but that what I get from users, I can’t check all dates manually and edit it.
The easiest way is that making parse_birthday function to check if the date is correct or not.
I wish I can develop on ruby; I try to learn it.

zogstrip · July 24, 2017, 11:36am

What if you replace the method by this?

  def parse_birthday(birthday)
    return if birthday.blank?
    date_of_birth   = Date.strptime(birthday, "%m-%d-%Y") rescue nil
    date_of_birth ||= Date.strptime(birthday, "(%m-%d-%Y)") rescue nil
    return if date_of_birth.nil?
    date_of_birth.year < 1904 ? Date.new(1904, date_of_birth.month, date_of_birth.day) : date_of_birth
  end

mtawil · July 24, 2017, 11:59am

OK, it works and skips the wrong date now.

New issue :
After importing more than 200K users (I stop it by CTRL+C) I went to check out the imported users and sadly there are no inserted data in ‘discourse_development.users’ table.
Should it complete the importer to show up the users data or what?

zogstrip · July 24, 2017, 12:26pm

I pushed the fixes so that I don’t forget.

Yup. The bulk importer works in “batches”. And by “batch” I mean “table by table”.
For speed, everything is kept in memory before it’s sent to the database.

zogstrip · July 24, 2017, 12:57pm

Also, I just merged a PR fixing the bulk importer regarding to our email extraction into another table.

@mtawil you should totally update before re-running the import.

quangbuule · July 24, 2017, 1:05pm

Thank @zogstrip!
There are some problems with charset, some forums started from very early of vBulletin era, at that time, its charset was latin1. So the texts after importing are really messy.

Will have a PR on that.

mtawil · July 24, 2017, 4:41pm

Thank you! I will update it of course.

BTW, The speed rate continues to slow down by time. For more clarification, please check this video out:

That’s just ~1.5M users, imagine more than 80M posts, the speed rate maybe will going to be 1/sec

pfaffman · July 24, 2017, 5:02pm

How much ram do you have? You’ll need to have enough ram to hold the whole table. My guess is that you’re starting to swap and that’s slowing your down. Is swap on a hard drive or ssd?

mtawil · July 24, 2017, 5:56pm

Well, Does Intel 16 Core Xeon CPU @ 2.30GHz w/ 64GB of RAM is not enough?

pfaffman · July 24, 2017, 5:58pm

Oh. Darn. So much for that explanation.

mtawil · July 25, 2017, 11:04am

Can it be “row by row”? Instead of “table by table”? And starts from where it ended (after last imported ID)?
This will be very helpful for large forums.

zogstrip · July 25, 2017, 11:09am

Then it won’t be a bulk importer, would it?

mtawil · July 25, 2017, 11:14am

Well, can we call it a chunk importer?

pfaffman · July 25, 2017, 11:52am

That’s what the regular importer does. Perhaps that’s what you want.

mtawil · July 25, 2017, 11:58am

What I mean is when I want to stop the bulk importer and rerun it (for speed rate issue), it should start at the last row, not from the first row.

quangbuule · July 25, 2017, 2:59pm

Hi guys, is there any need of mapping old forums (categories) into new configurable categories and tags? We can merge or split old categories into a new structure.

My idea is having .yml file that contains new categories and their old categories’ ids, including tags somehow.

mtawil · July 26, 2017, 12:44pm

New issue:

discourse@ip-10-0-1-178-app:/var/www/discourse$ IMPORT=1 RAILS_ENV=production ruby script/bulk_import/vbulletin.rb
Loading application...
Starting...
Preloading I18n...
Fixing highest post numbers...
Loading imported group ids...
Loading imported user ids...
Loading imported category ids...
Loading imported topic ids...
Loading imported post ids...
Loading groups indexes...
Loading users indexes...
Loading categories indexes...
Loading topics indexes...
Loading posts indexes...
Importing groups...
Importing users...
1270000 -    119/sec
        /var/www/discourse/script/bulk_import/base.rb:521:in `blank?': invalid byte sequence in UTF-8 (ArgumentError)
        from /var/www/discourse/script/bulk_import/base.rb:521:in `fix_name'
        from /var/www/discourse/script/bulk_import/base.rb:234:in `process_user'
        from /var/www/discourse/script/bulk_import/base.rb:486:in `block (2 levels) in create_records'
        from /usr/local/lib/ruby/gems/2.4.0/gems/rack-mini-profiler-0.10.5/lib/patches/db/mysql2.rb:6:in `each'
        from /usr/local/lib/ruby/gems/2.4.0/gems/rack-mini-profiler-0.10.5/lib/patches/db/mysql2.rb:6:in `each'
        from /var/www/discourse/script/bulk_import/base.rb:483:in `block in create_records'
        from /usr/local/lib/ruby/gems/2.4.0/gems/pg-0.20.0/lib/pg/connection.rb:160:in `copy_data'
        from /var/www/discourse/script/bulk_import/base.rb:482:in `create_records'
        from /var/www/discourse/script/bulk_import/base.rb:191:in `create_users'
        from script/bulk_import/vbulletin.rb:131:in `import_users'
        from script/bulk_import/vbulletin.rb:81:in `execute'
        from /var/www/discourse/script/bulk_import/base.rb:33:in `run'
        from script/bulk_import/vbulletin.rb:494:in `<main>'

My forum encoding is “UTF8mb4”

Topic		Replies	Views
Migrate a Vanilla forum to Discourse Sysadmins how-to	44	15872	January 30, 2023
Migrate a phpBB3 forum to Discourse Migrating to Discourse how-to	458	95788	March 13, 2025
Migrate a XenForo forum to Discourse Sysadmins how-to	96	19832	February 25, 2025
Migrating vBulletin 5 database - Import script errors Migration vbulletin5	46	2211	March 8, 2023
[Paid] Need a Vanilla 2 Import tool Marketplace	67	10821	May 2, 2015

Importers for large forums

Related topics