HOWTO: Import MBOX (mailing list) files

import

(Jay Pfaffman) #1

:warning: This Howto is out-of-date.

Please check out Importing mailing lists (mbox, Listserv, Google Groups, emails, ...) for updated instructions.


Want to move your mailing list archive to Discourse?

Let’s do it!

Prerequisites

  1. Not everything that writes MBOX files plays by the same rules. You will likely need to do some programming to get your import to work just right. This script should import most of your data, but unless you are very lucky, getting every bit of it will take a bit of code tweaks.

  2. If you are using a hosting service, make sure you know what version of Discourse they are using. If the Discourse version you use for the import is newer than your target platform, you will not be able to restore the database you create in this guide.

  3. Set up Discourse development environment on OS X, Ubuntu or Windows.

  4. Clear existing data from your local Discourse instance:

cd ~/discourse
rake db:drop db:create db:migrate
  1. Optional: add an admin account:
RAILS_ENV=development bundle exec rake admin:create

Preparation of your files

Are your mailbox files perfect, with no broken email addresses, no text you want to remove, and you want all those messages in the same category? Skip to the next section.

Note: Files are imported in alpha order, so if you want your topic IDs to match the date order, name your files accordingly. If you don’t care about that, neither does Discourse.

It is easiest to leave lots of messages in each file (typically one file per month), but you may choose to use formail to split those files into one message per file so that you could do clever things with grep and find to move messages into different folders to categorize them.

You can split your mbox file into one-message-per-file like this:

export FILENO=0000
mkdir split
formail -ds sh -c 'cat > split/msg.$FILENO' < mbox

If you wanted to move all of the messages with “job” in the subject into a separate directory, you might do this:

mkdir jobs
find . -name "msg.*" -exec egrep "^Subject:.*job" \{\} \; -exec mv \{\} jobs \;

Are there annoying footers or advertisements in your email messages? If you want to remove them /en masse/, you should do so now with perl, ruby, ex, awk, or sed. If you know Ruby, you might also look at clean_raw, which you can also tweak to do some replacements. Text can also be removed (or modified) by the script itself. Look at the gsubs in def clean_raw in the script for an example.

Discourse does not try to remove email signatures. If you’re foolhardy, determined, or a perfectionist, you might split them all into single files with formail and see how things go if you remove everything after the first /^-- $/ in a message. This is left as an exercise to the reader.

Configuring the import script

  • Paste the following into your shell/terminal:
export MBOX_SUBDIR="messages" # subdirectory with mbox files
export LIST_NAME=LIST_NAME
export DEFAULT_TRUST_LEVEL=1
export DATA_DIR=~/data/import
export SPLIT_AT="^From " # or "^From (.*)"

You can then use the up-arrow to return to those lines and use the arrow keys to edit them as described below.

  1. Replace MBOX_DIR value with the location of your mbox directories. An SQLite database file will be created in this location.
  • Replace MBOX_SUBDIR if it is not called “messages”. Gzipped files are OK. If you create subdirectories below messages, they can be imported into separate categories (see below).

  • If you have Subjects like Subject: [List Name] blah, and you would like [List name] removed, set LIST_NAME accordingly.

  • Users created by the script will be created with DEFAULT_TRUST_LEVEL. Set this value to whatever level you deem appropriate.

  • Check the format of your MBOX files to see what they look like. The MBOX “standard” (in as much as there is one) is that all lines that start with "From " (that’s “From” followed by a space); lines in messages (rather than the header) that begin with “From” are supposed to have a “>” inserted before them. Your mileage may vary. Set SPLIT_AT accordingly.

  • If you have organized your messages into folders/directories by categories, you will need to edit mbox.rb to map those directory names to category names in section like this:

  CATEGORY_MAPPINGS = {
    "default" => "uncategorized",
    # ex: "jobs-folder" => "jobs"
  }
  • If the email address in the From: line is in the body of a message, it will be replaced with the @username. If you would like to replace every user’s email address in every message, you can uncomment # replace_email_addresses in the execute function. It can take a long time, though; for every user, it does a database query for their email address and then does a replace against all of those hits.

Perform Import

  1. Start import process:
cd ~/discourse
bundle exec ruby script/import_scripts/mbox.rb

Or, if you would like error and warning messages to be saved for future consumption:

cd ~/discourse
bundle exec ruby script/import_scripts/mbox.rb 2>> logfile

Note that if you redirect errors to the logfile, you will not see them in your terminal.

  1. Wait until the import is done. You can restart the process if it slows down to a crawl. Before adding data to Discourse it is first read from the MBOX files and stored in an SQLite database. When you have trouble with the import, you can look there for clues. If a message cannot be imported (e.g., email address invalid) you will get notification.
  • You can speed up restarts of the script by deleting (or moving) files that have already been processed.
  1. Start your Discourse instance: bundle exec rails server
  • Start Sidekiq and let it do its work:
bundle exec sidekiq -q critical,4 -q default,2 -q low

Depending on your forum size this can take a long time. You can monitor the progress at http://localhost:3000/sidekiq

  1. Backup the data on your development machine and upload it to on your production site by following this howto.

Congratulations! You have successfully migrated your mailing list to Discourse! :tada:


Import Previous Emails
(Gunnar Helliesen) #2

I’m a newbie with Ruby and could use some help, please. I’ve followed the instructions for setting up a dev environment on Ubuntu (14.04.5 LTS Server) to the letter. But when I try start the import using the command shown above, I get an error message. Here’s the output:

gunnar@callum:~/discourse$ bundle exec ruby script/import_scripts/mbox.rb
script/import_scripts/mbox.rb:1:in `require': cannot load such file -- sqlite3 (LoadError)
	from script/import_scripts/mbox.rb:1:in `<main>'
gunnar@callum:~/discourse$ ruby -v
ruby 2.3.0p0 (2015-12-25 revision 53290) [x86_64-linux]
gunnar@callum:~/discourse$ sqlite3 -version
3.8.2 2013-12-06 14:53:30 27392118af4c38c5203a04b8013e1afdb1cebd0d
gunnar@callum:~/discourse$ irb
irb(main):001:0> require 'sqlite3'
=> true
irb(main):002:0> 
gunnar@callum:~/discourse$

Any ideas?

Thanks!
Gunnar


(Jay Pfaffman) #3

Add sqlite to the Gemfile and

bundle install

(Gunnar Helliesen) #4

Thank you! That did indeed solve that problem. I got a different error now, though:

gunnar@callum:~/discourse$ bundle exec ruby script/import_scripts/mbox.rb
loading existing groups...
loading existing users...
loading existing categories...
loading existing posts...
loading existing topics...
        1 / 1 (100.0%)  
creating indices
0 records couldn't be associated with parents
0 replies to wire up

importing users
/home/gunnar/discourse/script/import_scripts/base.rb:223:in `all_records_exist?': private method `exec' called for nil:NilClass (NoMethodError)
	from script/import_scripts/mbox.rb:336:in `block in import_users'
	from /home/gunnar/discourse/script/import_scripts/base.rb:801:in `block in batches'
	from /home/gunnar/discourse/script/import_scripts/base.rb:800:in `loop'
	from /home/gunnar/discourse/script/import_scripts/base.rb:800:in `batches'
	from script/import_scripts/mbox.rb:333:in `import_users'
	from script/import_scripts/mbox.rb:56:in `execute'
	from /home/gunnar/discourse/script/import_scripts/base.rb:45:in `perform'
	from script/import_scripts/mbox.rb:555:in `<main>'
gunnar@callum:~/discourse$

(Jay Pfaffman) #5

I think it’s because it can’t find your data.


(Gunnar Helliesen) #6

You’re right again. The sqlite database is empty, save for two empty tables, emails and users. Now to figure out why…

Thanks for all your help!

Gunnar


(Jay Pfaffman) #7

Did you do these things?

export MBOX_SUBDIR="messages" # subdirectory with mbox files
export LIST_NAME=LIST_NAME
export DEFAULT_TRUST_LEVEL=1
export DATA_DIR=~/data/import
export SPLIT_AT="^From " # or "^From (.*)"

The DATA_DIR is the one that needs fixing.

Next time I do an mbox import I’ll see about making the error checking more robust. The first time I ran it, I thought I’d never figure it out.


(Gunnar Helliesen) #8

Yes, I did define those. The problem was that I didn’t understand that MBOX_SUBDIR is relative to DATA_DIR, so I created it in the wrong place, in ~/discourse. Once I moved it inside DATA_DIR, the import ran fine.

Thanks!


(Gunnar Helliesen) #9

Is there any way of speeding up the import process? With a reasonable amount of CPU and memory, I’m getting about 27,000 messages imported per hour. With 1.7 million messages, it’ll take a month to import everything. We can’t be down for a month.

Clearly, I need to throw an unreasonable amount of resources at this.

I’m thinking a very large EC2 instance, with /var/lib/postgresql and ~/discourse mounted on memory file systems. Is there anything else I can do to speed things up? Give some process more memory to run in, or make some other process run on several cores? Ideas?

Thanks!


(Gunnar Helliesen) #10

New problem. I’ve set up folder to category mappings in mbox.rb, however, when I run the import, it complains that the first file is a directory.

From mbox.rb:

  CATEGORY_MAPPINGS = {
    "Collectibles" => "Collectibles",
    "Concours" => "Concours",
    "DaimLan" => "DaimLan",
    "E-Type" => "E-Type",
    "Jag-lovers" => "Jag-lovers",
    "Jag-News" => "Jag-News",
    "Jag-News-Discussion" => "Jag-News-Discussion",
    "Lumps" => "Lumps",
    "Modern" => "Modern",
    "Policy" => "Policy",
    "Pre-XK" => "Pre-XK",
    "Pub" => "Pub",
    "Racing" => "Racing",
    "S-Type" => "S-Type",
    "Saloons" => "Saloons",
    "V12-Engine" => "V12-Engine",
    "Webteam" => "Webteam",
    "Workers" => "Workers",
    "X-Type" => "X-Type",
    "X300" => "X300",
    "XJ" => "XJ",
    "XJ-S" => "XJ-S",
    "XJ40" => "XJ40",
    "XK" => "XK",
    "XK-Engine" => "XK-Engine",
    "XK8" => "XK8"
  }

The mbox files are organized into those subdirs under messages, but when I run the import I get this:

Processing: /home/gunnar/discourse/data/import/messages/Collectibles
script/import_scripts/mbox.rb:106:in `each_line': Is a directory @ io_fillbuf - fd:16 /home/gunnar/discourse/data/import/messages/Collectibles (Errno::EISDIR)
        from script/import_scripts/mbox.rb:106:in `each_line'
        from script/import_scripts/mbox.rb:133:in `block in all_messages'
        from script/import_scripts/mbox.rb:125:in `each'
        from script/import_scripts/mbox.rb:125:in `each_with_index'
        from script/import_scripts/mbox.rb:125:in `all_messages'
        from script/import_scripts/mbox.rb:264:in `create_email_indices'
        from script/import_scripts/mbox.rb:77:in `execute'
        from /home/gunnar/discourse/script/import_scripts/base.rb:45:in `perform'
        from script/import_scripts/mbox.rb:579:in `<main>'

Any ideas?

Thanks!


(Jay Pfaffman) #11

Edit: You already knew a better way to fix this than what I described, so I deleted that part.

I don’t know how to speed it up. The good news, though, is that you can do the import now and then do it again againsted the updated data set whenever it finishes and it’ll import only what’s missing. You can even mv the already-processed files somewhere else so it won’t even read them again (if it does read them again, it’ll skip importing them, still saving time).


(Gunnar Helliesen) #12

Actually, line 94 needs to be changed:

files << Dir["#{MBOX_DIR}/#{MBOX_SUBDIR}/#{k}/*"]

And also, there needs to be a test for “is this file a directory” in the function each_line().

  def each_line(f)
    infile = File.open(f, 'r')
    if !File.directory?(infile)
      if f.ends_with?('.gz')
        gz = Zlib::GzipReader.new(infile)
        gz.each_line do |line|
          yield line
        end
      else
        infile.each_line do |line|
          yield line
        end
      end
    end
  ensure
    infile.close
  end

I tried restarting the import per subfolder, but as far as I could tell, each restart of mbox.rb initialized the sqlite3 database to an empty state, because it would only contain the messages from the last import? At least as far as I could tell with my limited knowledge of SQL.

As for speed, firing up an EC2 instance with a lot of memory and putting everything (including the Postgresql database) on a memory file system certainly helped. Unfortunately the Ruby import script is only using one thread and one CPU core, so that’s my limitation now.


(Gunnar Helliesen) #13

Had another crash, this time I think the import was in the process of processing replies. @pfaffman, any ideas how to get past this?

/home/gunnar/.rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/rack-mini-profiler-0.10.1/lib/patches/db/pg.rb:50:in `exec': ERROR:  value too long for type character varying(200) (PG::StringDataRightTruncation)
	from /home/gunnar/.rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/rack-mini-profiler-0.10.1/lib/patches/db/pg.rb:50:in `exec'
	from /home/gunnar/discourse/script/import_scripts/base.rb:213:in `all_records_exist?'
	from script/import_scripts/mbox.rb:523:in `block in import_replies'
	from /home/gunnar/discourse/script/import_scripts/base.rb:801:in `block in batches'
	from /home/gunnar/discourse/script/import_scripts/base.rb:800:in `loop'
	from /home/gunnar/discourse/script/import_scripts/base.rb:800:in `batches'
	from script/import_scripts/mbox.rb:518:in `import_replies'
	from script/import_scripts/mbox.rb:82:in `execute'
	from /home/gunnar/discourse/script/import_scripts/base.rb:45:in `perform'
	from script/import_scripts/mbox.rb:581:in `<main>'

Thanks!


(Jay Pfaffman) #14

Nice work. I was relieved that you knew something about programming. I was looking for a 2-minute solution that I could possibly explain to someone who didn’t know how to program. This is obviously a much better solution than what I was recommending.

I’m not entirely sure, but

ERROR:  value too long for type character varying(200) (PG::StringDataRightTruncation)

suggests that the data going into the table is too long. I can’t remember what’s in that key, but my best guess is that you have a really freaking long subject, or one that has lots of characters that need to be escaped, making the string really long. Could that be the case?

Oh. Here’s an idea. Change

to

    conn.exec('CREATE TEMP TABLE import_ids(val varchar(400) PRIMARY KEY)')

You could still get into trouble down the line if subjects (aka topic subjects) are too long, but it’s likely that the code that inserts those will do the Right Thing.


(Gunnar Helliesen) #15

Gotcha. Will try ASAP and let you know how it goes. Thanks for all your help, you’re awesome!


(Jay Pfaffman) #16

When you get it figured out, please submit it as a PR. If you don’t want to fool with submitting a PR, then send me the code and I’ll submit it.


(Nick Johannessen) #17

I’m working with Gunnar on this project and appreciate his frustration in having to start the import from scratch every time a problem halts the import. Is it possible to change the SQLite3 code to check for an existing database instead of creating a fresh one on each run? I believe the relevant code is this one:

def open_db
   SQLite3::Database.new("#{MBOX_DIR}/index.db")
end

There may of course be excellent reasons for not allowing the import to continue, in which case, ignore my comment :slight_smile:


(Jay Pfaffman) #18

When I used the script it did not create the database every time. You have to delete the database by hand if you wanted to clear it. That’s the way all of the import scripts work.

I can have a look at it later today.


(Gunnar Helliesen) #19

I tried using

SQLite3::Database.open("#{MBOX_DIR}/index.db")

instead, but as far as I could tell, only the messages from the last import were preserved. I tested by opening index.db in SQLite3 and looking at the contents of the emails table. Still not sure about how the interplay works between mbox.rb, SQLite3 and Postgresql, though.


(Jay Pfaffman) #20

Well, I swear that when I used this script it re-used the database on each run and I had to delete the index.db file by hand if I wanted to delete messages imported already, but I don’t see how that’s the case looking at the code. Not only does open_db create a new database, but create_email_indices drops the emails table if it exists already. It must have been some other importer that worked that way.

Here’s what happens: create_email_indices and create_user_indices, build that SQLite database from the Mbox files. When they’re all moved from MBOX to SQLite, then import_users, create_forum_topics, and import_replies pull the info from SQLite and stick it into Discourse.

When it sticks stuff into SQLite, it does an INSERT OR IGNORE, so it won’t stick multiple copies into SQLite.

Here’s what you might do:

  • fix open_db such that it’ll use an existing database.
  • fix create_email_indices and create_user_indices such that they don’t drop the table (I might move the creation of those tables into open_db)

If SQLite is smart enough, you might be able to run multiple processes that build the SQLite database (here is what SQLite has to say about concurrent writes). That would let you speed up the MBOX-to-SQLite part of the process.