Migrating vBulletin 5 database - Import script errors

I’m going to start by apologize to anyone that would feel attacked by this post because, to be honest, it’s since Monday that I’m wrangling these issues and at this point I’m tired of doing debug/hotfix for discourse code.

After the nth try (stopped counting after the 7th) I think I’m going to give up because seems like migration is not something Discourse has invested much time to support.

I believe that the biggest problem is that the charset used in this enormous database is utf8mb4 which is not supported by the script(?).

Using utf8 (default) simply generate lots of errors that are being reported but it’s not clear what is happening as the script goes ahead anyway. Is the entry in the DB being skipped? Copied over with some unsupported characters (the classic squares)?

On top of that, the three different latest runs (using the bulk importers), with the exact same set of instructions followed, have different results. This last run reached the topic import, started immediately reporting errors but going ahead (???):

Loading application...
Starting...
Preloading I18n...
Fixing highest post numbers...
Loading imported group ids...
Loading imported user ids...
Loading imported category ids...
Loading imported topic ids...
Loading imported post ids...
Loading groups indexes...
Loading users indexes...
Loading categories indexes...
Loading topics indexes...
Loading posts indexes...
Loading post actions indexes...
Importing categories...
Importing parent categories...
      5 -   1104/sec
Importing children categories...
    500 -   1539/secERROR:  duplicate key value violates unique constraint "unique_index_categories_on_name"
DETAIL:  Key (COALESCE(parent_category_id, '-1'::integer), name)=(-1, Armata Brancaleone) already exists.
CONTEXT:  COPY categories, line 69
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/pg-1.4.5/lib/pg/connection.rb:204:in `get_last_result'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/pg-1.4.5/lib/pg/connection.rb:204:in `copy_data'
/var/www/discourse/script/bulk_import/base.rb:720:in `create_records'
/var/www/discourse/script/bulk_import/base.rb:361:in `create_categories'
script/bulk_import/vbulletin5.rb:291:in `import_categories'
script/bulk_import/vbulletin5.rb:69:in `execute'
/var/www/discourse/script/bulk_import/base.rb:98:in `run'
script/bulk_import/vbulletin5.rb:779:in `<main>'
Importing topics...
    600 -   4073/sec
ERROR: undefined method `[]' for nil:NilClass
/var/www/discourse/script/bulk_import/base.rb:513:in `process_topic'
/var/www/discourse/script/bulk_import/base.rb:724:in `block (2 levels) in create_records'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/mysql2/alias_method.rb:8:in `each'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/mysql2/alias_method.rb:8:in `each'
/var/www/discourse/script/bulk_import/base.rb:721:in `block in create_records'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/pg-1.4.5/lib/pg/connection.rb:196:in `copy_data'
/var/www/discourse/script/bulk_import/base.rb:720:in `create_records'
/var/www/discourse/script/bulk_import/base.rb:364:in `create_topics'
script/bulk_import/vbulletin5.rb:321:in `import_topics'
script/bulk_import/vbulletin5.rb:70:in `execute'
/var/www/discourse/script/bulk_import/base.rb:98:in `run'
script/bulk_import/vbulletin5.rb:779:in `<main>'

Until finally crashing on this one:

script/bulk_import/vbulletin5.rb:779:in `<main>'
 572329 -    531/sec
Importing replies...
client_loop: send disconnect: Connection reset

But not before basically constantly spamming left and right these two errors:

ERROR: undefined method `gsub!' for nil:NilClass
script/bulk_import/vbulletin5.rb:727:in `preprocess_raw'
script/bulk_import/vbulletin5.rb:369:in `block in import_topic_first_posts'
/var/www/discourse/script/bulk_import/base.rb:723:in `block (2 levels) in create_records'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/mysql2/alias_method.rb:8:in `each'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/mysql2/alias_method.rb:8:in `each'
/var/www/discourse/script/bulk_import/base.rb:721:in `block in create_records'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/pg-1.4.5/lib/pg/connection.rb:196:in `copy_data'
/var/www/discourse/script/bulk_import/base.rb:720:in `create_records'
/var/www/discourse/script/bulk_import/base.rb:367:in `create_posts'
script/bulk_import/vbulletin5.rb:361:in `import_topic_first_posts'
script/bulk_import/vbulletin5.rb:71:in `execute'
/var/www/discourse/script/bulk_import/base.rb:98:in `run'
script/bulk_import/vbulletin5.rb:779:in `<main>'

and

ERROR: invalid byte sequence in UTF-8
script/bulk_import/vbulletin5.rb:727:in `gsub!'
script/bulk_import/vbulletin5.rb:727:in `preprocess_raw'
script/bulk_import/vbulletin5.rb:369:in `block in import_topic_first_posts'
/var/www/discourse/script/bulk_import/base.rb:723:in `block (2 levels) in create_records'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/mysql2/alias_method.rb:8:in `each'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/mysql2/alias_method.rb:8:in `each'
/var/www/discourse/script/bulk_import/base.rb:721:in `block in create_records'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/pg-1.4.5/lib/pg/connection.rb:196:in `copy_data'
/var/www/discourse/script/bulk_import/base.rb:720:in `create_records'
/var/www/discourse/script/bulk_import/base.rb:367:in `create_posts'
script/bulk_import/vbulletin5.rb:361:in `import_topic_first_posts'
script/bulk_import/vbulletin5.rb:71:in `execute'
/var/www/discourse/script/bulk_import/base.rb:98:in `run'
script/bulk_import/vbulletin5.rb:779:in `<main>'

Please note that I’ve gone step by step by commenting which function to run, the running the rake import:ensure_consistency before continuing by commenting the ones that just ran and so on, because if I just let the whole script rerun previously run steps, it simply crash by finding duplicated IDs.

Before the usual “you can’t complain for free software” argument comes out, I want to clarify that I’m contributing to other open source project and are making software for free as well but it’s just paramount to me that if I release something, that something work and is well documented (even just so I can avoid the thousands of messages rightfully asking ‘how this works’) or I’m ready to fix whatever bug comes out.

While discourse seems to have a great out-of-the-box experience, it should be well clear that it’s 2022 and communities existed long before this product. “Adoption” would need to have a strong migration support and it doesn’t seem like it’s the current state for discourse.

I recognize that a 20GB database is an edge case but we are not having problem with the size here, rather the charset or who-knows-what as there isn’t even a constant error and most of of: there is no documentation beside going hunting for threads and post left by who have gone through the same ordeal in the past, hoping a workaround was found and that the source code haven’t changed much since then.

At this point I would strongly recommend anyone coming from vbulletin to hold on any migration until what seems an overhaul of the migrations script (being underway it seems?) is being completed.