Discourse already has around 40 importers in order to cover a wide range of community software.
These importers work very well, but they tend to be slow for very large forums.
That’s why we’ve built the bulk importers.
What is a bulk importer?
Our standard importers go through the same code paths as the application. This has the advantage of ensuring the imported data is consistent. But tends to be slow since it’s importing record by record…
In order to go faster, we need to import in bulk.
In order to import in bulk, we need to bypass Rails and use SQL.
This solution has 2 drawbacks
We lose pretty much all the validations (since it’s done in Rails), but we can import 25 million posts in a couple of hours instead of a week
We need to keep it up to date whenever we change the structure of the database
There’s not much we can do about #1 other than being careful to respect them in the importers.
For #2 we decided to split the code in 2 parts
An importer script which will import the minimum viable content
A rake task that is launched post-import in order to populate all the other required columns and tables
The importer will be responsible for importing the most important data that can’t be computed.
The rake task will be responsible for computing all the missing (but required) data and stats.
A bulk importer will only import
groups (name, description)
users (email, username, name, title, admin/moderator, status, date of birth)
user passwords & salts (so they can re-use the same password)
user profiles (location, website, description)
categories (name, description)
topics (title, user, category, status, type)
posts (user, topic, raw, reply to post number, type, reads)
post_actions (bookmarks, likes, flags)
tags (name)
A bulk importer will not import
posts revisions
groups permissions
categories permissions
avatars (1)
attachments (2)
(1) the script stores the avatar’s URLs in a custom field which can be used later to download the avatars (2) downloading & manipulating files is easily the slowest part of the import, but we might add support for bulk importing attachments
When to use a bulk importer?
If you are planning to migrate a forum with more than 5 million posts to Discourse, then it is recommended to try our bulk importers.
We currently only support bulk importing from vBulletin but are planning to support phpBB and XenForo as well.
How to bulk import?
Setup
You need to have a working development environment of Discourse.
The database of the forum you are importing should be running on the same machine for best performance
Import
Fire up your terminal and go to the discourse directory
Install the gem used by the importer
IMPORT=1 bundle install
Run the importer
ruby script/bulk_import/vbulletin.rb
You can change the locale by using the LOCALE environment variable
LOCALE=fr ruby script/bulk_import/vbulletin.rb
You can also change the connection settings of the imported database
Hey @zogstrip I recently began doing some tests to migrate my 6M+ posts forum over to Discourse but I was facing a lot of trouble with the importer taking so long. I’m so happy that the bulk importer is now a reality.
May I ask who is working on the XenForo importer? I would really love to help you guys out since I was going to do it anyway with the current XenForo importer. If a PR is welcome, I might be up to the task!
This is looks interesting! Is this task safe to run on an existing site?
After some fiddling in the rails console, I once ended up with an install where topic counts were wrong (e.g. an empty category claimed to have 10 topics in it) and manually wrote code to fix this (see # update all topic counts in the post linked above). It sounds like this rake task would probably solve issues like this
Hello,
I had a vBulletin forum with more than 80M posts and more than 100GB of database size, and I need to do a migration from vBulletin to Discourse by using that fantastic tool (bulk importer).
I did follow your instructions step by step with no luck, every time I run the script I got many types of errors.
First of all, the script will not be working by root user ever. When I run the script through root user I got this error:
Loading application...
URGENT: FATAL: Peer authentication failed for user "discourse"
Failed to initialize site default
/usr/local/lib/ruby/gems/2.4.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/postgresql_adapter.rb:651:in `initialize': FATAL: Peer authentication failed for user "discourse" (PG::ConnectionBad)
Ok, let’s get into discourse user by su - discourse and run the script:
discourse@ip-10-0-1-178-app:/var/www/discourse$ ruby script/bulk_import/vbulletin.rb
Loading application...
/usr/local/lib/ruby/gems/2.4.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/postgresql_adapter.rb:661:in `rescue in connect': FATAL: database "discourse_development" does not exist (ActiveRecord::NoDatabaseError)
......
Just to be clear, I did a hard search about setting discourse development environment, but I can’t find anything, all results are talking about how to install Discourse step by step without using Docker, this is the old way!
Please, your assistance in this matter is highly appreciated.
Thank you.
We don’t recommend running an import on a production instance.
But if you really want to, you’ll need to tell the script to use the production database
Here is what I got when executing the script with production environment:
discourse@ip-10-0-1-178-app:/var/www/discourse$ RAILS_ENV=production ruby script/bulk_import/vbulletin.rb
Loading application...
/usr/local/lib/ruby/gems/2.4.0/gems/activesupport-4.2.8/lib/active_support/dependencies.rb:274:in `require': cannot load such file -- mysql2 (LoadError)
from /usr/local/lib/ruby/gems/2.4.0/gems/activesupport-4.2.8/lib/active_support/dependencies.rb:274:in `block in require'
from /usr/local/lib/ruby/gems/2.4.0/gems/activesupport-4.2.8/lib/active_support/dependencies.rb:240:in `load_dependency'
from /usr/local/lib/ruby/gems/2.4.0/gems/activesupport-4.2.8/lib/active_support/dependencies.rb:274:in `require'
from script/bulk_import/vbulletin.rb:2:in `<main>'
......
......
Bundle complete! 99 Gemfile dependencies, 176 gems now installed.
Gems in the group development were not installed.
Use `bundle info [gemname]` to see where a bundled gem is installed.
So you did not test the script inside of Docker? Have you tried to run inside it?
Do you have any documentation of how to install Discourse without using Docker?
Is anyone actively working the Xenforo bulk importer? I would be very interested in helping test this when it’s ready.
I tried the regular Xenforo importer on my site with about 1.5 million posts and it doesn’t appear to be working very well. First, all my posts are being thrown into the “Uncategorized” Category. I was expecting each “Forum” in Xenforo to become a category in Discourse but that’s not what happened. And it’s running extremely slow. importing about 100 posts a minute which by my calculations will take well over a week to import everything.
Ok, I did initialize the development environment and try to run the importer with discourse user:
$ IMPORT=1 ruby script/bulk_import/vbulletin.rb
Loading application...
Starting...
Preloading I18n...
Fixing highest post numbers...
Loading imported group ids...
Loading imported user ids...
Loading imported category ids...
Loading imported topic ids...
Loading imported post ids...
Loading groups indexes...
Loading users indexes...
/usr/local/lib/ruby/gems/2.4.0/gems/rack-mini-profiler-0.10.5/lib/patches/db/pg.rb:90:in `async_exec': PG::UndefinedTable: ERROR: missing FROM-clause entry for table "user_emails" (ActiveRecord::StatementInvalid)
LINE 1: SELECT user_emails.email FROM "users"
^
: SELECT user_emails.email FROM "users"
from /usr/local/lib/ruby/gems/2.4.0/gems/rack-mini-profiler-0.10.5/lib/patches/db/pg.rb:90:in `async_exec'
from /usr/local/lib/ruby/gems/2.4.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/postgresql_adapter.rb:592:in `block in exec_no_cache'
from /usr/local/lib/ruby/gems/2.4.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/abstract_adapter.rb:484:in `block in log'
from /usr/local/lib/ruby/gems/2.4.0/gems/activesupport-4.2.8/lib/active_support/notifications/instrumenter.rb:20:in `instrument'
from /usr/local/lib/ruby/gems/2.4.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/abstract_adapter.rb:478:in `log'
from /usr/local/lib/ruby/gems/2.4.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/postgresql_adapter.rb:592:in `exec_no_cache'
from /usr/local/lib/ruby/gems/2.4.0/gems/activerecord-4.2.8/lib/active_record/connection_adapters/postgresql_adapter.rb:584:in `execute_and_clear'
from /var/www/discourse/lib/freedom_patches/fast_pluck.rb:41:in `select_raw'
from /var/www/discourse/lib/freedom_patches/fast_pluck.rb:67:in `pluck'
from /var/www/discourse/script/bulk_import/base.rb:96:in `load_indexes'
from /var/www/discourse/script/bulk_import/base.rb:32:in `run'
from script/bulk_import/vbulletin.rb:377:in `<main>'