Yep, I am going to do that if I manage to have at least the user import complete. Currently it’s exploding when trying to work on the emails
Just remove the first line of the script/bulk_import/vbulletin5.rb
script
# frozen_string_literal: true
Alright so, running only the first three functions:
def execute
# enable as per requirement:
#SiteSetting.automatic_backups_enabled = false
#SiteSetting.disable_emails = "non-staff"
#SiteSetting.authorized_extensions = '*'
#SiteSetting.max_image_size_kb = 102400
#SiteSetting.max_attachment_size_kb = 102400
#SiteSetting.clean_up_uploads = false
#SiteSetting.clean_orphan_uploads_grace_period_hours = 43200
#SiteSetting.max_category_nesting = 3
import_groups
import_users
import_group_users
#import_user_emails
#import_user_stats
#import_user_profiles
#import_user_account_id
#import_categories
#import_topics
#import_topic_first_posts
#import_replies
#import_likes
#import_private_topics
#import_topic_allowed_users
#import_private_first_posts
#import_private_replies
#create_oauth_records
#create_permalinks
#import_attachments
end
Result:
I am assuming that that message about the ensuring consistency is meant for when the full import is done? Or should I run it each “step” I run and then make a copy of the discourse
directory from the host to have a backup?
Launching it again with the next 4 functions active return an error for already existing ids
Can this be an “all or nothing”? Maybe it expect everything to be done in one big transaction?
Retried again. Process went through for quite some time then suddenly this.
The frustrating part is that it seems out of the blue.
Now launching it again result in this error.
Too tired now to check what it refers to. Especially because duplicate key value
shouldn’t really happen at all if I simply relaunched the bulk import script, shouldn’t it??
I’m going to start by apologize to anyone that would feel attacked by this post because, to be honest, it’s since Monday that I’m wrangling these issues and at this point I’m tired of doing debug/hotfix for discourse code.
After the nth try (stopped counting after the 7th) I think I’m going to give up because seems like migration is not something Discourse has invested much time to support.
I believe that the biggest problem is that the charset used in this enormous database is utf8mb4 which is not supported by the script(?).
Using utf8 (default) simply generate lots of errors that are being reported but it’s not clear what is happening as the script goes ahead anyway. Is the entry in the DB being skipped? Copied over with some unsupported characters (the classic squares)?
On top of that, the three different latest runs (using the bulk importers), with the exact same set of instructions followed, have different results. This last run reached the topic import, started immediately reporting errors but going ahead (???):
Loading application...
Starting...
Preloading I18n...
Fixing highest post numbers...
Loading imported group ids...
Loading imported user ids...
Loading imported category ids...
Loading imported topic ids...
Loading imported post ids...
Loading groups indexes...
Loading users indexes...
Loading categories indexes...
Loading topics indexes...
Loading posts indexes...
Loading post actions indexes...
Importing categories...
Importing parent categories...
5 - 1104/sec
Importing children categories...
500 - 1539/secERROR: duplicate key value violates unique constraint "unique_index_categories_on_name"
DETAIL: Key (COALESCE(parent_category_id, '-1'::integer), name)=(-1, Armata Brancaleone) already exists.
CONTEXT: COPY categories, line 69
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/pg-1.4.5/lib/pg/connection.rb:204:in `get_last_result'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/pg-1.4.5/lib/pg/connection.rb:204:in `copy_data'
/var/www/discourse/script/bulk_import/base.rb:720:in `create_records'
/var/www/discourse/script/bulk_import/base.rb:361:in `create_categories'
script/bulk_import/vbulletin5.rb:291:in `import_categories'
script/bulk_import/vbulletin5.rb:69:in `execute'
/var/www/discourse/script/bulk_import/base.rb:98:in `run'
script/bulk_import/vbulletin5.rb:779:in `<main>'
Importing topics...
600 - 4073/sec
ERROR: undefined method `[]' for nil:NilClass
/var/www/discourse/script/bulk_import/base.rb:513:in `process_topic'
/var/www/discourse/script/bulk_import/base.rb:724:in `block (2 levels) in create_records'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/mysql2/alias_method.rb:8:in `each'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/mysql2/alias_method.rb:8:in `each'
/var/www/discourse/script/bulk_import/base.rb:721:in `block in create_records'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/pg-1.4.5/lib/pg/connection.rb:196:in `copy_data'
/var/www/discourse/script/bulk_import/base.rb:720:in `create_records'
/var/www/discourse/script/bulk_import/base.rb:364:in `create_topics'
script/bulk_import/vbulletin5.rb:321:in `import_topics'
script/bulk_import/vbulletin5.rb:70:in `execute'
/var/www/discourse/script/bulk_import/base.rb:98:in `run'
script/bulk_import/vbulletin5.rb:779:in `<main>'
Until finally crashing on this one:
script/bulk_import/vbulletin5.rb:779:in `<main>'
572329 - 531/sec
Importing replies...
client_loop: send disconnect: Connection reset
But not before basically constantly spamming left and right these two errors:
ERROR: undefined method `gsub!' for nil:NilClass
script/bulk_import/vbulletin5.rb:727:in `preprocess_raw'
script/bulk_import/vbulletin5.rb:369:in `block in import_topic_first_posts'
/var/www/discourse/script/bulk_import/base.rb:723:in `block (2 levels) in create_records'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/mysql2/alias_method.rb:8:in `each'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/mysql2/alias_method.rb:8:in `each'
/var/www/discourse/script/bulk_import/base.rb:721:in `block in create_records'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/pg-1.4.5/lib/pg/connection.rb:196:in `copy_data'
/var/www/discourse/script/bulk_import/base.rb:720:in `create_records'
/var/www/discourse/script/bulk_import/base.rb:367:in `create_posts'
script/bulk_import/vbulletin5.rb:361:in `import_topic_first_posts'
script/bulk_import/vbulletin5.rb:71:in `execute'
/var/www/discourse/script/bulk_import/base.rb:98:in `run'
script/bulk_import/vbulletin5.rb:779:in `<main>'
and
ERROR: invalid byte sequence in UTF-8
script/bulk_import/vbulletin5.rb:727:in `gsub!'
script/bulk_import/vbulletin5.rb:727:in `preprocess_raw'
script/bulk_import/vbulletin5.rb:369:in `block in import_topic_first_posts'
/var/www/discourse/script/bulk_import/base.rb:723:in `block (2 levels) in create_records'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/mysql2/alias_method.rb:8:in `each'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/mysql2/alias_method.rb:8:in `each'
/var/www/discourse/script/bulk_import/base.rb:721:in `block in create_records'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/pg-1.4.5/lib/pg/connection.rb:196:in `copy_data'
/var/www/discourse/script/bulk_import/base.rb:720:in `create_records'
/var/www/discourse/script/bulk_import/base.rb:367:in `create_posts'
script/bulk_import/vbulletin5.rb:361:in `import_topic_first_posts'
script/bulk_import/vbulletin5.rb:71:in `execute'
/var/www/discourse/script/bulk_import/base.rb:98:in `run'
script/bulk_import/vbulletin5.rb:779:in `<main>'
Please note that I’ve gone step by step by commenting which function to run, the running the rake import:ensure_consistency
before continuing by commenting the ones that just ran and so on, because if I just let the whole script rerun previously run steps, it simply crash by finding duplicated IDs.
Before the usual “you can’t complain for free software” argument comes out, I want to clarify that I’m contributing to other open source project and are making software for free as well but it’s just paramount to me that if I release something, that something work and is well documented (even just so I can avoid the thousands of messages rightfully asking ‘how this works’) or I’m ready to fix whatever bug comes out.
While discourse seems to have a great out-of-the-box experience, it should be well clear that it’s 2022 and communities existed long before this product. “Adoption” would need to have a strong migration support and it doesn’t seem like it’s the current state for discourse.
I recognize that a 20GB database is an edge case but we are not having problem with the size here, rather the charset or who-knows-what as there isn’t even a constant error and most of of: there is no documentation beside going hunting for threads and post left by who have gone through the same ordeal in the past, hoping a workaround was found and that the source code haven’t changed much since then.
At this point I would strongly recommend anyone coming from vbulletin to hold on any migration until what seems an overhaul of the migrations script (being underway it seems?) is being completed.
While I feel your pain (migrations is a bear of a subject), as a Migration Specialist with Discourse let me set the record straight.
We have a mature migration framework with over 60 scripts for different platforms, a separate bulk framework with 5 scripts, and a newer framework in the works which massively improves on every aspect - performance, code organization, testability, verifiability, documentation, and so on.
We have a separate Migrations team with extensive core developer support, and we contribute generic improvements back into the code with each migration we complete. We’re constantly doing migrations for customers ranging from trivial to unbelievably complex.
Our end goal is to make migrations as streamlined as possible both for hosted customers and for the community, but the amount of code that’s in scope during a migration is just too massive, and system level software configuration, third party software changes and input data variability only compounds the problem.
Again, I wish all this stuff was more painless, but making it so takes untold worker hours to create and maintain and there are only so many to go around.
Don’t give up!
I appreciate and understand that the scope is immense. It’s just frustrating to keep stumbling in exception over exception and the fact that the project is written in ruby doesn’t help in finding help beside coming here basically, which, cannot possibly accommodate all the requests for help as, as you say, some have very niche cases that are simply impossible to help with unless having hands on the actual data.
I also, for a large amount, put the blame on the absolute clusterfuck that is the structure of vbulletin.
I just checked this morning and this is a summary of table sizes.
To give context, the “text” table is where the actual content is.
the node
table is holding the heriarchy and closure
… let me quote here because I can’t even:
The Closure table builds the parent-child relationships between all the nodes. That majority of your database is made up of attached files which shouldn’t be stored in the database anyway
So overall, for a forum with ~8GB of content, there is an overhead of 28GB. Great stuff, congrats vbulletin.
This is what I mean when I say that it’s frustrating.
Again, same set of actions (following a runbook written by myself with all the trial-and-error), running on a new discourse installation.
Result:
Where are you import_user_account_id
?
But most importantly? How did you manage to not cause an error the previous run where it failed on the topic import?
Commenting out that function invocation (which seems like it was important anyway) and launching again:
Those duplicated keys errors… shouldn’t the script knows that it has already done those ids and move on?
Every import is different. You’d think that a script that works for one instance of your-previously-favorite-forum
would Just Work, but it doesn’t. And for a huge forum, it’s really hard. It’s simply not something that is easy to support. And the bulk importers access the database directly rather than counting on rails to be able to automagically check stuff as it goes.
That’s a not-infrequent problem, and not the fault of the script. You’ll need to figure out how to get your old database moved to utf8.
There is strong migration support. There just isn’t free migration support. I’ve done on the order of 100 migrations and written several import scripts for unsupported or bespoke systems. I’d likely charge $3000-5000 to import your database. That’s not an offer, that’s just to give you an idea of how much work it is for someone who’s done it a bunch of times. I suspect that if you were to pay for a year of Business hosting, CDCK would do it for free, which may be less than I would charge to do it. (Oh, but you might not be eligible for business hosting with a database that size).
Continuing my exploration here.
- The script reference a function that does not exists:
import_user_account_id
. You (discourse devs) might want to fix that. - The logic that check for title’s of topics is somehow getting mad over some topics that for some reasons have empty string as title. As much as that shouldn’t happen, the check that evaluate that should catch it and return
nil
but apparently that breaks the follow up logic written in the import (see here).
I had that issue with some import I did in recent memory. Better would be for it to return something like “topic XXX is missing a title” or pull the first line of text from the post, but that’s hard to do in this context. I think what I’d be tempted to do is fix it by munging your database and use something else to generate titles where they are missing.
Yep, I’m basically “sanitizing” the db when I found these issues but it’s just hard as I have to debug the script and then guess what is causing the issue each time
Still no clue about the missing function import_user_account_id
Especially given that the holidays are coming very soon, it’s unlikely that anyone will be fixing that unless they are using the script themselves. (Usually when I say that Richard will come in and save the day.)
LOL, well, I guess I’m going to disappoint you today. I tried though, but I suspect that the commit of this importer was incomplete and did not include some changes to base.rb
. @justin worked on this, maybe he knows. I do suspect that this could have been a customer specific thing that can be commented out without further consequences.
I have never used the bulk importers myself either.
Yes, import scripts can be complex and depend on the database specifics, but some scripts are simply not in a working state. That goes for this one as well, and there are some more scripts with for instance # frozen_string_literal: true
that are just not working out of the box.
Ha!
That’s (at least)
part of why I find it so difficult to submit PRs for the changes that I make. By the time I’m done there’s so much case-specific stuff in there I’m afraid that whatever I submit will be broken somehow.
Yeah. I think that something went through and added frozen_string_literal
to every file. Most files got fixed because they had tests, but there are no tests for the import scripts.
Hey, just to clarify, I’m not expecting anyone to fix this now (Karen style). I’m just pointing out some things that clearly have some issues on the codebase itself and probably are just “uops, I forgot to add this change to the commit! ”
I’ve already accepted that this migration won’t happen before January, at the least, at this point.
Everyone should just enjoy the holidays
I’ll bring this up or open a new thread after the holidays even if I’ll definitely have less time to dedicated to this migration
Yeah, totally true - for imports I usually don’t submit a PR before I did two imports from different clients.
Just keeping this updated.
I’ve made some changes to the base.rb
after discussing it with some other engineers we have in our community. Lots of errors were caused by gsub!
erroring because apparently we had some topics with ''
as title.
We added a function mimicking the normalize_text
function that simply return the imported_id
of the thread if there is not content, casted as string.
def normalize_text_thread(text, imported_id)
return imported_id.to_s unless text.present?
@html_entities.decode(normalize_charset(text.presence || "").scrub)
end
Then in vbulletin5.rb
changed the line in the create_topic
into:
create_topics(topics) do |row|
created_at = Time.zone.at(row[5])
title = normalize_text_thread(row[1], row[0])
That got rid of the issue. Basically gsub!
doesn’t cope well with getting a nil
in input.
However, this made the script go on but when it reached the import_private_topics
it hanged there. There are 253.427 private topics (pm) in our DB which are several order of magnitude less than replies. After 9 hours I stopped the script to see what was going on actually.
Firing up the interface I noticed a couple of things.
- My account was not imported because the admin user created was using the same email I suppose. Obvious but something that should be written somewhere maybe?
- Only some of the categories (vbulletin subforums) where imported
- Only topics and their first reply were imported (not sure if all of them really) and they were all imported without being in the correct categories, even the ones that had category that would have been created. Everything is imported “without category”.
- The “replies counter” shows
-1
, probably because the replies were actually not imported at all.
I’ll add the overall, LOTS of issues with this bulk imported would go away if it was implementing a pagination approach. I think replies have gone missing because the script tried to go through them all at once and with 7GB of data it was impossible. It baffles me that a bulk importer doesn’t approach the import with a pagination approach to be honest. Even simply taking 1000 record at the time, writing them and storing the last record id written and looping would solve any issue with big databases.
FWIW, I’m following this with interest, and am very much appreciating the updates. I don’t know much about migrations so far, but I am finding this very informative.