Importers for large forums


(Régis Hanol) #1

Discourse already has around 40 importers in order to cover a wide range of community software.
These importers work very well, but they tend to be slow for very large forums.
That’s why we’ve built the bulk importers.

What is a bulk importer?

Our standard importers go through the same code paths as the application. This has the advantage of ensuring the imported data is consistent. But tends to be slow since it’s importing record by record…

In order to go faster, we need to import in bulk.
In order to import in bulk, we need to bypass Rails and use SQL.

This solution has 2 drawbacks

  1. We lose pretty much all the validations (since it’s done in Rails), but we can import 25 million posts in a couple of hours instead of a week
  2. We need to keep it up to date whenever we change the structure of the database

There’s not much we can do about #1 other than being careful to respect them in the importers.
For #2 we decided to split the code in 2 parts

  • An importer script which will import the minimum viable content
  • A rake task that is launched post-import in order to populate all the other required columns and tables

The importer will be responsible for importing the most important data that can’t be computed.
The rake task will be responsible for computing all the missing (but required) data and stats.

A bulk importer will only import

  • groups (name, description)
  • users (email, username, name, title, admin/moderator, status, date of birth)
  • user passwords & salts (so they can re-use the same password)
  • user profiles (location, website, description)
  • categories (name, description)
  • topics (title, user, category, status, type)
  • posts (user, topic, raw, reply to post number, type, reads)
  • post_actions (bookmarks, likes, flags)
  • tags (name)

A bulk importer will not import

  • posts revisions
  • groups permissions
  • categories permissions
  • avatars (1)
  • attachments (2)

(1) the script stores the avatar’s URLs in a custom field which can be used later to download the avatars
(2) downloading & manipulating files is easily the slowest part of the import, but we might add support for bulk importing attachments

When to use a bulk importer?

If you are planning to migrate a forum with more than 5 million posts to Discourse, then it is recommended to try our bulk importers.

We currently only support bulk importing from vBulletin but are planning to support phpBB and XenForo as well.

How to bulk import?

Setup

  • You need to have a working development environment of Discourse.
  • The database of the forum you are importing should be running on the same machine for best performance

Import

  1. Fire up your terminal and go to the discourse directory

  2. Install the gem used by the importer

     IMPORT=1 bundle install
    
  3. Run the importer

     ruby script/bulk_import/vbulletin.rb
    

    You can change the locale by using the LOCALE environment variable

     LOCALE=fr ruby script/bulk_import/vbulletin.rb
    

    You can also change the connection settings of the imported database

     DB_HOST=localhost DB_USERNAME=user DB_PASSWORD=1234 DB_NAME=myforum ruby script/bulk_import/vbulletin.rb
    

Post-import

  1. Once the import is done, you need to run a rake task to generate all computed data and stats

     rake import:ensure_consistency
    
  2. Create a backup

     ./script/discourse backup
    
  3. Upload the backup to your production instance, enable restoring from a backup and restore your imported data

  4. :tada:


Fe_sendauth: No password supplied in bulk importer
Wrong post count in profile
Importing from phpBB3
PHPBB3 Importer - Postgresql
Vbulletin bulk importer can't find PG database
(Jay Pfaffman) #2

This is very cool!

But that’s not important now.

When did that become official? Last I knew:

And especially since password rules are :poop:, I’m surprised that importing a database full of likely crappy passwords is supported now.


(Julian Muñoz) #3

Hey @zogstrip I recently began doing some tests to migrate my 6M+ posts forum over to Discourse but I was facing a lot of trouble with the importer taking so long. I’m so happy that the bulk importer is now a reality.

May I ask who is working on the XenForo importer? I would really love to help you guys out since I was going to do it anyway with the current XenForo importer. If a PR is welcome, I might be up to the task!

Cheers!


(Régis Hanol) #4

@pfaffman passwords & salts are imported in custom fields so that you can use the discourse-migratepassword (or similar) plugin to ease the migration :wink:

No one is working on it. PR is more than welcome :+1: Feel free to ask me questions if you need to.


(Felix Freiberger) #5

This is looks interesting! Is this task safe to run on an existing site?

After some fiddling in the rails console, I once ended up with an install where topic counts were wrong (e.g. an empty category claimed to have 10 topics in it) and manually wrote code to fix this (see # update all topic counts in the post linked above). It sounds like this rake task would probably solve issues like this :slight_smile:


(Régis Hanol) #6

It should be safe. But I haven’t testing it thoroughly. Highly recommended that you take a backup first :wink:


#7

Hello,
I had a vBulletin forum with more than 80M posts and more than 100GB of database size, and I need to do a migration from vBulletin to Discourse by using that fantastic tool (bulk importer).

I did follow your instructions step by step with no luck, every time I run the script I got many types of errors.


First of all, the script will not be working by root user ever. When I run the script through root user I got this error:

Loading application...
URGENT: FATAL:  Peer authentication failed for user "discourse"
 Failed to initialize site default
/usr/local/lib/ruby/gems/2.4.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/postgresql_adapter.rb:651:in `initialize': FATAL:  Peer authentication failed for user "discourse" (PG::ConnectionBad)

Ok, let’s get into discourse user by su - discourse and run the script:

discourse@ip-10-0-1-178-app:/var/www/discourse$ ruby script/bulk_import/vbulletin.rb
Loading application...
/usr/local/lib/ruby/gems/2.4.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/postgresql_adapter.rb:661:in `rescue in connect': FATAL:  database "discourse_development" does not exist (ActiveRecord::NoDatabaseError)
......

Just to be clear, I did a hard search about setting discourse development environment, but I can’t find anything, all results are talking about how to install Discourse step by step without using Docker, this is the old way!


Please, your assistance in this matter is highly appreciated.
Thank you.


(Régis Hanol) #8

We don’t recommend running an import on a production instance.
But if you really want to, you’ll need to tell the script to use the production database

RAILS_ENV=production ruby script/bulk_import/vbulletin.rb

#9

Here is what I got when executing the script with production environment:

discourse@ip-10-0-1-178-app:/var/www/discourse$ RAILS_ENV=production ruby script/bulk_import/vbulletin.rb
Loading application...
/usr/local/lib/ruby/gems/2.4.0/gems/activesupport-4.2.8/lib/active_support/dependencies.rb:274:in `require': cannot load such file -- mysql2 (LoadError)
	from /usr/local/lib/ruby/gems/2.4.0/gems/activesupport-4.2.8/lib/active_support/dependencies.rb:274:in `block in require'
	from /usr/local/lib/ruby/gems/2.4.0/gems/activesupport-4.2.8/lib/active_support/dependencies.rb:240:in `load_dependency'
	from /usr/local/lib/ruby/gems/2.4.0/gems/activesupport-4.2.8/lib/active_support/dependencies.rb:274:in `require'
	from script/bulk_import/vbulletin.rb:2:in `<main>'

(Régis Hanol) #10

Did you do the bundle install step in production mode too?


#11

Yes

......
......
Bundle complete! 99 Gemfile dependencies, 176 gems now installed.
Gems in the group development were not installed.
Use `bundle info [gemname]` to see where a bundled gem is installed.

#12

Just to be clear, I did install bundles by using IMPORT=1 RAILS_ENV=production bundle install


(Aref) #13

Hi Dears @zogstrip @codinghorror @sam
We wait for the solution of the problem , and from @mtawil wait the final result of the transfer.

Thanks


(Régis Hanol) #14

Like I said, your best luck is to run the import in a dev environment outside of docker.


#15

So you did not test the script inside of Docker? Have you tried to run inside it?
Do you have any documentation of how to install Discourse without using Docker?

Thank you.


(Jay Pfaffman) #16

Search for “Development install #howto


(Mr. Revrag) #17

Is anyone actively working the Xenforo bulk importer? I would be very interested in helping test this when it’s ready.

I tried the regular Xenforo importer on my site with about 1.5 million posts and it doesn’t appear to be working very well. First, all my posts are being thrown into the “Uncategorized” Category. I was expecting each “Forum” in Xenforo to become a category in Discourse but that’s not what happened. And it’s running extremely slow. importing about 100 posts a minute which by my calculations will take well over a week to import everything.


#18

Ok, I did initialize the development environment and try to run the importer with discourse user:

$ IMPORT=1 ruby script/bulk_import/vbulletin.rb
Loading application...
Starting...
Preloading I18n...
Fixing highest post numbers...
Loading imported group ids...
Loading imported user ids...
Loading imported category ids...
Loading imported topic ids...
Loading imported post ids...
Loading groups indexes...
Loading users indexes...
/usr/local/lib/ruby/gems/2.4.0/gems/rack-mini-profiler-0.10.5/lib/patches/db/pg.rb:90:in `async_exec': PG::UndefinedTable: ERROR:  missing FROM-clause entry for table "user_emails" (ActiveRecord::StatementInvalid)
LINE 1: SELECT user_emails.email FROM "users"
               ^
: SELECT user_emails.email FROM "users"
	from /usr/local/lib/ruby/gems/2.4.0/gems/rack-mini-profiler-0.10.5/lib/patches/db/pg.rb:90:in `async_exec'
	from /usr/local/lib/ruby/gems/2.4.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/postgresql_adapter.rb:592:in `block in exec_no_cache'
	from /usr/local/lib/ruby/gems/2.4.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/abstract_adapter.rb:484:in `block in log'
	from /usr/local/lib/ruby/gems/2.4.0/gems/activesupport-4.2.8/lib/active_support/notifications/instrumenter.rb:20:in `instrument'
	from /usr/local/lib/ruby/gems/2.4.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/abstract_adapter.rb:478:in `log'
	from /usr/local/lib/ruby/gems/2.4.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/postgresql_adapter.rb:592:in `exec_no_cache'
	from /usr/local/lib/ruby/gems/2.4.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/postgresql_adapter.rb:584:in `execute_and_clear'
	from /var/www/discourse/lib/freedom_patches/fast_pluck.rb:41:in `select_raw'
	from /var/www/discourse/lib/freedom_patches/fast_pluck.rb:67:in `pluck'
	from /var/www/discourse/script/bulk_import/base.rb:96:in `load_indexes'
	from /var/www/discourse/script/bulk_import/base.rb:32:in `run'
	from script/bulk_import/vbulletin.rb:377:in `<main>'

What is wrong?


#19

Dear Régis,
I got that error because of this line:

@emails = User.unscoped.pluck(:"user_emails.email").to_set

Is this line correct or there is something should be fixed?


(Régis Hanol) #20

That’s indeed not working anymore. I think it’s missing a join like this

@emails = User.unscoped.joins(:user_emails).pluck(:"user_emails.email").to_set