Importers for large forums


#21

Yeah, I edited it, and it works.
But now I face a new issue when importing users:

Importing users...
script/bulk_import/vbulletin.rb:363:in `strptime': invalid date (ArgumentError)
	from script/bulk_import/vbulletin.rb:363:in `parse_birthday'
	from script/bulk_import/vbulletin.rb:78:in `block in import_users'
	from /var/www/discourse/script/bulk_import/base.rb:438:in `block (2 levels) in create_records'
	from /usr/local/lib/ruby/gems/2.4.0/gems/rack-mini-profiler-0.10.5/lib/patches/db/mysql2.rb:6:in `each'
	from /usr/local/lib/ruby/gems/2.4.0/gems/rack-mini-profiler-0.10.5/lib/patches/db/mysql2.rb:6:in `each'
	from /var/www/discourse/script/bulk_import/base.rb:437:in `block in create_records'
	from /usr/local/lib/ruby/gems/2.4.0/gems/pg-0.20.0/lib/pg/connection.rb:160:in `copy_data'
	from /var/www/discourse/script/bulk_import/base.rb:436:in `create_records'
	from /var/www/discourse/script/bulk_import/base.rb:178:in `create_users'
	from script/bulk_import/vbulletin.rb:72:in `import_users'
	from script/bulk_import/vbulletin.rb:25:in `execute'
	from /var/www/discourse/script/bulk_import/base.rb:33:in `run'
	from script/bulk_import/vbulletin.rb:377:in `<main>'

(Régis Hanol) #22

Can you share some examples of birthday dates? Maybe the format isn’t the same.


#23

The incorrect date is (02-29-1989)


#24

I know it’s the wrong date, but that what I get from users, I can’t check all dates manually and edit it.
The easiest way is that making parse_birthday function to check if the date is correct or not.
I wish I can develop on ruby; I try to learn it.


(Régis Hanol) #25

What if you replace the method by this?

  def parse_birthday(birthday)
    return if birthday.blank?
    date_of_birth   = Date.strptime(birthday, "%m-%d-%Y") rescue nil
    date_of_birth ||= Date.strptime(birthday, "(%m-%d-%Y)") rescue nil
    return if date_of_birth.nil?
    date_of_birth.year < 1904 ? Date.new(1904, date_of_birth.month, date_of_birth.day) : date_of_birth
  end

#26

OK, it works and skips the wrong date now.

New issue :weary: :
After importing more than 200K users (I stop it by CTRL+C) I went to check out the imported users and sadly there are no inserted data in ‘discourse_development.users’ table.
Should it complete the importer to show up the users data or what?


(Régis Hanol) #27

I pushed the fixes so that I don’t forget.

Yup. The bulk importer works in “batches”. And by “batch” I mean “table by table”.
For :zap: speed, everything is kept in memory before it’s sent to the database.


(Régis Hanol) #28

Also, I just merged a PR fixing the bulk importer regarding to our email extraction into another table.

@mtawil you should totally update before re-running the import.


(Quang-Buu Le) #29

Thank @zogstrip!
There are some problems with charset, some forums started from very early of vBulletin era, at that time, its charset was latin1. So the texts after importing are really messy.

Will have a PR on that.


#30

Thank you! I will update it of course.

BTW, The speed rate continues to slow down by time. For more clarification, please check this video out:

That’s just ~1.5M users, imagine more than 80M posts, the speed rate maybe will going to be 1/sec :slight_smile:


(Jay Pfaffman) #31

How much ram do you have? You’ll need to have enough ram to hold the whole table. My guess is that you’re starting to swap and that’s slowing your down. Is swap on a hard drive or ssd?


#32

Well, Does Intel 16 Core Xeon CPU @ 2.30GHz w/ 64GB of RAM is not enough?


(Jay Pfaffman) #33

Oh. Darn. So much for that explanation. :exploding_head:


#34

Can it be “row by row”? Instead of “table by table”? And starts from where it ended (after last imported ID)?
This will be very helpful for large forums.


(Régis Hanol) #35

Then it won’t be a bulk importer, would it? :wink:


#36

Well, can we call it a chunk importer? :roll_eyes:


(Jay Pfaffman) #37

That’s what the regular importer does. Perhaps that’s what you want.


#38

What I mean is when I want to stop the bulk importer and rerun it (for speed rate issue), it should start at the last row, not from the first row.


(Quang-Buu Le) #39

Hi guys, is there any need of mapping old forums (categories) into new configurable categories and tags? We can merge or split old categories into a new structure.

My idea is having .yml file that contains new categories and their old categories’ ids, including tags somehow.


#40

New issue:

discourse@ip-10-0-1-178-app:/var/www/discourse$ IMPORT=1 RAILS_ENV=production ruby script/bulk_import/vbulletin.rb
Loading application...
Starting...
Preloading I18n...
Fixing highest post numbers...
Loading imported group ids...
Loading imported user ids...
Loading imported category ids...
Loading imported topic ids...
Loading imported post ids...
Loading groups indexes...
Loading users indexes...
Loading categories indexes...
Loading topics indexes...
Loading posts indexes...
Importing groups...
Importing users...
1270000 -    119/sec
        /var/www/discourse/script/bulk_import/base.rb:521:in `blank?': invalid byte sequence in UTF-8 (ArgumentError)
        from /var/www/discourse/script/bulk_import/base.rb:521:in `fix_name'
        from /var/www/discourse/script/bulk_import/base.rb:234:in `process_user'
        from /var/www/discourse/script/bulk_import/base.rb:486:in `block (2 levels) in create_records'
        from /usr/local/lib/ruby/gems/2.4.0/gems/rack-mini-profiler-0.10.5/lib/patches/db/mysql2.rb:6:in `each'
        from /usr/local/lib/ruby/gems/2.4.0/gems/rack-mini-profiler-0.10.5/lib/patches/db/mysql2.rb:6:in `each'
        from /var/www/discourse/script/bulk_import/base.rb:483:in `block in create_records'
        from /usr/local/lib/ruby/gems/2.4.0/gems/pg-0.20.0/lib/pg/connection.rb:160:in `copy_data'
        from /var/www/discourse/script/bulk_import/base.rb:482:in `create_records'
        from /var/www/discourse/script/bulk_import/base.rb:191:in `create_users'
        from script/bulk_import/vbulletin.rb:131:in `import_users'
        from script/bulk_import/vbulletin.rb:81:in `execute'
        from /var/www/discourse/script/bulk_import/base.rb:33:in `run'
        from script/bulk_import/vbulletin.rb:494:in `<main>'

My forum encoding is “UTF8mb4