More generic importer

(Michael - #1

Over the last year, we have been using our own importer at DiscourseHosting. We think we have some conceptual advantages over the ‘official’ importers, but also a lot of disadvantages. We have been contemplating cleaning up our code and releasing it, but I think it would be nicer to take some of our concepts and re-use them in the official importers.

I am opening this topic to discuss the advantages and disadvantages, and to see if you would be interested if we opened up a PR for this.

First of all, a comparison:

  • The official importers are more cleanly coded

  • The official importers are a bit faster

  • The official importers have more features - like nested categories

  • The official importers are well, the official importers

  • The offical importers are the official importers, as in: there are multiple. Some things are implemented in one importer but not in another. We have a single importer using YML config files for the different forum types like bbPress, Phorum, Vanilla, VBulletin. The importer gets the YML config file as a command line argument.

  • This makes it faster and easier to create a new importer in our approach. Almost all code is reused, we only need to make new queries mapping the source database to a generic import format

  • The official importers have hardcoded MySQL credentials, source DB and table prefixes. Those are read from the YML file in our importer, making it easier to change stuff and keep the old code as well

  • When we develop a new feature, it applies to all existing import scripts. You are using derived classes but there is still quite a lot of logic in the derived class.

  • Our import script has support for password migration using our plugin GitHub - discoursehosting/discourse-migratepassword: Support migrated password hashes so people can login using their original password

  • Our import script can write the Discourse ID to the orginal database, making redirection scripts more easy to implement on the original server (you do it the other way around but that isn’t always the best solution for our customers, for instance when the original forum is in and the new forum is at ).

  • This also makes restarting the import much faster.

An example of such a YML config script can be found below.

I would like to know if you are interested in us creating a scripts/import_scripts/generic.rb script that uses our concepts but that is using your base class like all the other importers, or maybe you explicitly chose for your current approach and you are not interested in this.

Example vanilla.yml

sql_server: localhost
sql_user: root
sql_password: password
sql_database: XXXXX

discourse_admin: system

test_mode: false

max_errors: 1000
prepare_users_query: |
  ALTER TABLE gdn_user
prepare_posts_query: |
  ALTER TABLE gdn_comment  
  ADD COLUMN discourse_id INT NOT NULL DEFAULT '0'; 

prepare_topics_query: |
  ALTER TABLE gdn_discussion  
  ADD COLUMN discourse_id INT NOT NULL DEFAULT '0'; 

get_user_query: |
    u.UserID AS user_id,
    u.Name AS fullname,
    u.Name AS username,
    u.Password AS crypted_password,
    u.Email AS email,
    IF(u.DateLastActive = '', DateInserted, DateLastActive) AS lastvisit,
    u.Admin AS is_admin,
    1 AS is_active,
  FROM gdn_user u
  WHERE u.Name != 'System'
    AND u.discourse_id != -1

get_post_query: |
    d.DiscussionID * 1000000 AS post_id,
    d.DiscussionID AS topic_id,
    d.Name AS topic_title,
    u.Name AS username,
    u.UserID AS user_id,
    u.discourse_id AS discourse_user_id,
    cat.Name AS category_name,
    d.DateInserted AS post_time,
    IFNULL(d.DateUpdated, d.DateInserted) AS post_edit_time,
    replace(replace(d.body,'\t', ''), '<br />', '\n') AS post_text,
  FROM gdn_discussion d
  LEFT JOIN gdn_category cat ON cat.CategoryID = d.CategoryID
  LEFT JOIN gdn_user u ON u.UserID = d.InsertUserID
  WHERE d.discourse_id = 0 AND d.InsertUserID > 0
  GROUP BY d.DiscussionID
    c.CommentID AS post_id,
    d.DiscussionID AS topic_id,
    d.Name AS topic_title,
    u.Name AS username,
    u.UserID AS user_id,
    u.discourse_id AS discourse_user_id,
    cat.Name AS category_name,
    c.DateInserted AS post_time,
    IFNULL(c.DateUpdated,c.DateInserted) AS post_edit_time,
    replace(replace(c.body,'\t', ''), '<br />', '\n') AS post_text,
    FROM gdn_comment c
  LEFT JOIN gdn_discussion d ON d.DiscussionID = c.DiscussionID
  LEFT JOIN gdn_category cat ON cat.CategoryID = d.CategoryID
  LEFT JOIN gdn_user u ON u.UserID = c.InsertUserID
  WHERE c.discourse_id = 0 AND c.InsertUserID > 0
  GROUP BY c.CommentID
  ORDER BY post_time

get_likes_query: |
unique_topic_query: |
  SELECT discourse_id
  FROM gdn_discussion
  WHERE DiscussionID = %d

process_user_query: |
  UPDATE gdn_user
  SET discourse_id = %d
  WHERE UserID = %d
process_topic_query: |
  UPDATE gdn_discussion
  SET discourse_id = %d
  WHERE DiscussionID = %d

process_post_query: |
  UPDATE gdn_comment
  SET discourse_id = %d
  WHERE CommentID = %d

reset_topics_query: |
  UPDATE gdn_discussion
  SET discourse_id = 0
reset_posts_query: |
  UPDATE gdn_comment
  SET discourse_id = 0
reset_users_query: |
  UPDATE gdn_user
  SET discourse_id = 0

(Marcin Cieślak) #2

One feature I usually miss from the importers is URL mapping - hard to do but very useful. It is very hard to find some old, useful discussion after migration…

(Michael - #3

We usually create redirection scripts in PHP or nginx config that do this. For each topic and post we store the new Discourse ID and use that to redirect to.

(Michael - #4

So is anybody actually interested in us doing this? I had hoped for some feedback about our approach…

(Joe Seyfried) #5

For me, the import function is pretty essential. However, I have also reached a point where I can live with the result - plus, this (kind of tedious) process only has to be performed once.

So yes, I would like to the the importer improved. But I guess the resonance around here will be moderate - people considering to switch to Discourse will try the (official) importers, and either run away instantly, or live with the results. Don’t get me wrong: the official importer is actually pretty good, especially if compared to other systems. The only real major gripe I have is speed.

(Erick Guan) #6

It’s hard to know without more example and code. By putting aside the method, how does it help with the generic?
The logic in different importer may replicate, but it’s because we faced a different table schema. It requires you write more complicated SQL instead of Ruby. It prevents you from partitioned table.

Cool part! I want to know how it process different kind of data, e.g. date, datetime?

This is good. The importer code may be tweaked every time for better migration but this configuration part is stable and can be moved to Yaml file. The difficulties here is every importer may have different kind of dependencies. Some may ask MySQL, some may ask another. In some cases, we need to assign directory path. How do I know which constants need to be set?

Not sure about that…You end up writing different logic to authenticate. It’s a bad smell since core changes.

Does it handle the different parameters of old url?

(Gerhard Schlager) #7

That’s an interesting idea, but I’m not sure if this can actually work. Imports usually need a lot of processing of posts and other raw text (like replacing BBCodes, changing internal links, replacing smilies or parsing a birthdate that’s stored in some crazy format). There’s nothing generic about that - it’s quite specific to each kind of forum software. I’m not sure how you’d put that in the generic importer. You’d need a special importer for each forum to do that. What would be the benefit of such a generic importer?

I agree. Those hard-coded values are far from ideal… In fact I dislike it so much that I moved all the configuration options from the phpBB3 importer (still a work in progress) into a YML file.

Well, everything that’s put inside the base importer is available to all import scripts. That’s quite the same, isn’t it? And as I said above, I believe there’s a need for logic that’s specific to each forum.

Yeah, I’d like to have that in the official importer as well! Would be great if you could add that as optional import feature to the base importer. Then I wouldn’t have to do it myself. :wink: I was planning to add it during my work on the phpBB3 importer.

Why is it faster? Can you elaborate on that?

In my opinion the current importers (at least some of them) are quite good.
Yes, they are uncomfortable to use. Yes, the hard-coded values are horrible.
And yes, the base importer is a large beast that probably needs some refactoring. :slight_smile:

I started extracting a few classes from it (since I needed them for my work on the phpBB3 importer), but there’s still a lot of work to do. And of course I had to leave all the old methods (as delegates) in the base importer in order to not break all the other importers. I split up the phpBB3 importer into multiple classes as well since I didn’t feel comfortable putting 1400 lines of code (logic, SQL) into one file.

Overall I like the current importers. What I’d like to have in the future:

  • An easy way to import within the Docker container.
  • A settings file for each importer.
  • No need to change Ruby code unless someone wants to import something that’s not available per default.
  • Every importer should import as much data as possible (e.g. attachments, polls, messages).
  • No duplicate logic within the importers.
  • Smaller, reusable parts instead of one large base importer that is hard to maintain.

(Michael - #8

Thanks for all your feedback ! :heart:

That is correct. But this way we separate the oddities of the source database from the oddities of the destination.

Right now we need to have a UNIX_TIMESTAMP for all date/time stuff.

No, we have fallback logic that is only used when authentication using the regular way fails, and it’s only used at most once per user.

Yes, we have different sets of redirection scripts for phpBB, Phorum, VBulletin, etc.

It works, since we have made quite a lot of importers already. So the concept is proven.

It’s a bit the same indeed, but for us the main advantage was the ability to test SQL scripts directly in a database client, which allows for much faster development.

What we see is that all importers could use the superset of all this processing, i.e. processing made for one kind of forum doesn’t hurt the other.

Yeah, I see what you mean. And your approach was to split it into classes, ours was to separate SQL and Ruby code.

(Michael - #9

Oops, forgot one.

The current importers load all users/posts from the Discourse database, then start running SQL on the source database, and then call create_users/posts for all users. For each user/post it checks the array of existing users/posts and skips import if it’s already in there.
We just do SELECT columns FROM sourcedatabase.posts WHERE discourse_id = 0

So upon restart of a large import, the current ‘official’ importers will all loop through all existing users/posts for minutes and minutes, and ours just starts where it left off. If we see that a certain post went wrong, we adjust the code and then just UPDATE posts SET discourse_id = 0 WHERE id=x, restart the importer, and 5 seconds later the new post is imported using the new code. With the official importer this can take up to 20 minutes (depending on the forum size).

(Gerhard Schlager) #10

I don’t get it. That’s the same with the official importers. Copy the SQL and put it into the database client. Works the same for me…

I wouldn’t count on that. :wink:
Best case, it gets slower for every kind of forum (all those regexps are expensive). Worst case, it wreaks havoc.

Good point. I guess we could add that optimization to the existing importers.
But, the ideal solution would be to find the reasons why we even need to restart the imports. They should just work. Fast and without any user interaction.

(Michael - #11

Well, I think that’s inevitable during importer development.

(Thomas Wilson) #12

I’m very much in favour of this. We had to write a new importer for our old forum software (LDU), which was a nightmare. Every now and then the process would crash for one reason or another, forcing us to repair and restart. As you say, this is very time consuming - having resume functionality with your additional column trick is a great idea :smile:.

Integrating it with your migrate-pasword plugin is also a massive plus. How will your script treat newly imported users? Will they be ‘active’ from the word go?

(Neil Lalonde) #13

In almost all cases, I haven’t been able to use the import scripts as-is, because every import is a special snowflake with requirements that never came up before. Normally I need to edit the scripts to add code that wouldn’t be helpful to anyone else. That’s why I haven’t made an effort to remove the hard-coding of db names, etc.

This would be handy! Can you add to the base class? Why is it only a feature that your importer can have?

We should always try to put features in the base class. If there are places where we (I) haven’t done that, it could be fixed.

I’m not against a general all-purpose importer, but I fear double-maintenance and confusion.

(Michael - #14

Of course - and I just wanted to see if you’d be interested before submitting a PR :smile:

Ok, we’re going to do as follows: we have a Phorum importer script lying around and we might be doing another IPB migration in the near future. I will make a phorum.rb and try to incorporate as much of our existing concepts in there, submitting PR’s for the base class where convenient. I’ll be moving some stuff into an yml config file, and see how much logic phorum.rb will still (need to) contain. If it only contains SQL and some control logic, then we’ll see if we can make this more generic.

Migrating from mybb
How to import a Phorum database (via Vanilla Porter)
(Jeff Atwood) #15

Probably the best way to think of this is in terms of UI rather than code structure.

What we want, ideally by the end of 2015, is a web-based importing UI that lets you perform an import of, say, vBulletin 4 to Discourse, entirely through a web browser.

But maybe before we bolt on a web UI, we should make sure there is a relatively easy command line way to perform an import using a Digital Ocean droplet. This also implies fairly strict memory limits, we cannot ask people to get an 8gb ram instance just to make an import work.

(OK, maybe by the end of 2016?)