Importing from vBulletin 4

Thanks for the valuable info! :+1:t6:

I have another question, about the SQL query that select the vBulletin users.

When I imported my old phpbb forum to Discourse 3 years ago, where were about 20000 users .
Obviously, most of the users were unusued accounts. During these 3 years, Discourse has made his own clean up of inactive users, and we have now a more honest number of 3000 members.

When I imported my 180000 users from vBulletin and asked Discourse to aggressively do its cleanup job, I was left with 27000 users, which seems fair.

On my vBulletin, since:

  1. All messages posted by users on other users’s profiles are imported to Discourse to uncategorized, titleless topics that add nothing but useless noise, and
  2. The vast majority of users that posted only on other users’ profiles seems to be spammers,
    I’d like to do the cleanup during the import, and not after.

I don’t understand all of the vBulletin database, which is a bit confusing with its nodes thing, but I’d like to import only users that have posted at least 1 topic or reply.
It seems to me that topics and replies fill the lastpostid field in the vBulletin user table, but public profile posts don’t.

Si, I intend to edit

  SELECT u.userid, u.username, u.homepage, u.usertitle, u.usergroupid, u.joindate, u.email
    CASE WHEN u.scheme='blowfish:10' THEN token
         WHEN u.scheme='legacy' THEN REPLACE(token, ' ', ':')
    END AS password,
    IF(ug.title = 'Administrators', 1, 0) AS admin
    FROM #{DB_PREFIX}user u
    LEFT JOIN #{DB_PREFIX}usergroup ug ON ug.usergroupid = u.usergroupid
ORDER BY userid
   LIMIT #{BATCH_SIZE}
  OFFSET #{offset}

to:

  SELECT u.userid, u.username, u.homepage, u.usertitle, u.usergroupid, u.joindate, u.email, u.lastpost
    CASE WHEN u.scheme='blowfish:10' THEN token
         WHEN u.scheme='legacy' THEN REPLACE(token, ' ', ':')
    END AS password,
    IF(ug.title = 'Administrators', 1, 0) AS admin
    FROM #{DB_PREFIX}user u
    LEFT JOIN #{DB_PREFIX}usergroup ug ON ug.usergroupid = u.usergroupid
    WHERE u.lastpost > 0
ORDER BY userid
   LIMIT #{BATCH_SIZE}
  OFFSET #{offset}

I just add a WHERE u.lastpost > 0.

A count with this query gives me 25000 users, compared to the 27000 active users on my previous Import, after a Discourse cleanup but still having these titleless topics (former profile public messages). That would mean that about 2000 users would have posted only on other users’ profiles which is not an absurd number.

Do you think my reasoning is right and that adding WHERE u.lastpost > 0 will nicely clean up my user base without having any harmful effect?

It depends where your database has been. If it was migrated from phpBB to vBulletin or if it has gone through many vBulletin updates then this reasoning could be wrong.

The best you can do is to verify your reasoning by running more queries, for instance by listing all posts made by the users without a lastpost.

Also, if you have any plugins like “likes” or “voting” you might be removing users that you should not be removing.

My strategy is to be very conservative in removing or leaving out things. Storage is inexpensive.

4 Likes

Thank you, I’ll do as you say, I prefer to be careful. :slight_smile:
I’ll let Discourse do its own cleanup over time.

I’ve been adding custom posts cleanup regex in your migration script for days because the forum is very old and was migrated from phpbb before vbulletin, so many things had to be taken care of, but I think and I hope I’m close to finalizing things.

I don’t really know vbulletin, but the forum I’m working on was using vBulletin 5.6 and some external images I had posted in it were using this syntax in the vbulletin database:
[IMG2=JSON]{"data-align":"none","data-size":"full","src":"https:\/\/forum.monocycle.info\/uploads\/default\/original\/2X\/3\/396192845ba93e7df2a6109a2608072efa21ee32.jpeg"}[/IMG2]
I’m not sure is this is something that was “forgotten” in your script, or if the forum admin used some plugin generating these img2 tags.

Anyway, I fixed these with this code:

    raw = raw.gsub(/\[img2=json\].+?(http.+?).}\[\/img2\]/i) {"\n#{$1}\n"}

I have a question though: will discourse rebake the imported posts automatically over time? If it does, will it start with the most recent posts?

2 Likes

Hi again,
My forum has about 1000 tags, but we probably won’t use them on Discourse. Plus, these tags are probably a real mess. Can I just comment this line in the importer:

    import_tags

Or there may be some collateral damage?

1 Like

You can safely comment it out.

2 Likes

Is it safe not waiting for sidekiq to finish his jobs after importing something?

I imported my users and this is sidekiq’s current state.

What happens to these tasks if I create a backup and I restore it on a production forum?

1 Like

No, although a full rebake will recreate most of these jobs, I would absolutely recommend you to wait.

They will keep on running… on the import instance.
They will not be included in the backup or transferred to the production forum.

3 Likes

Thanks ! :+1:

Because of some error messages during backup restores (that don’t prevent the restore from finishing or the forum from working), I was wondering if it could be because I didn’t wait for sidekiq to finish his job.

I started a new import from vbulletin: I only imported groups and 30000 users on my dev discourse, waited a few dozen of minutes, then created a backup, restored the backup on a docker based installation. The restore worked, but the logs show these errors

ActiveRecord::StatementInvalid (PG::UndefinedTable: ERROR: relation "users" does not exist LINE 1: SELECT "users".* FROM "users" WHERE "users"."id" = 1 LIMIT 1 ^ ) lib/a
10:03 pm
7
ActiveRecord::StatementInvalid (PG::UndefinedTable: ERROR: relation "user_auth_tokens" does not exist LINE 1: SELECT "user_auth_tokens".* FROM "user_auth_tokens" WHERE ((...

They are inconsistent, and the relations errors differ from a backup to another.
I can’t figure out where do these come from. :confused:

1 Like

TL;DR I don’t think those errors are related to the import in any way.

I’ve seen that before, I think it is a race condition that is happening because the database is being accessed while the backup is restoring.

This could be because of regular traffic coming in at your server, or a Sidekiq process that hasn’t been paused. In both cases it is harmless. If I were you I would completely ignore all PG errors that occur before the restore has been finished completely.

2 Likes

That’s reassuring!

Thing is, not only the messages themselves are a bit scary to me, they are also scary because

  1. they happen for every (or almost? :thinking:) backup I’ve created since I started my import, whether I restore the backup in a local dev forum or a docker online install
  2. they prevent the restore log (in Discourse backup interface) to continue being written in the Discourse interface while it’s doing the restoration : it stays stuck on “unzipping archive” (or “Creating missing functions in the discourse_functions schema…” if it’s a backup without uploads).
    It looks like something has crashed, but if I wait, after a while I’m always properly and automatically logged out and when, I log in again, the import seems to have worked well besides these error messages.

Since the forum is working (besides the category edition that leads to a 502 error I described in another thread), I’m just afraid that the forum would work… Until it wouldn’t for some reason in a few weeks/months/years because of something I didn’t figure out first, which I really don’t want to happen, especially since I’ve been working on this import everyday for 1 month and counting. :sweat_smile:

Anyway, thanks for your help, it’s much appreciated. I’m putting a lot of energy in this unpaid import work for a large community, and having people replying to my questions is always relieving.

2 Likes