Migrate a mailing list to Discourse (mbox, Listserv, Google Groups, etc)

Thank you for providing this guide and import script! I have used it successfully with a google group, using google takeout. I just put the .mbox file in the right directory and ran the script.

I did have a question about importing emails which have parents which are not in the .mbox. For example, there are many threads in our group which are started from a FWD of an email that wasn’t sent to the group, or by adding the group to the reply list in the middle of a conversation to loop them in.

Currently, when importing it seems as if these previous emails are not present. You can find them, if you click on the email icon and view the HTML. I was curious if anyone else encountered this same situation and had any solutions for it. I could imagine either including the previous email chain in the post or trying to parse it and extract a number of messages from it and adding all of those.

1 Like

You would need to find a way to generate those messages from the quoted text and add them to the mbox file (probably with Id headers) before running the import script.

1 Like

This is really excellent. But I have some issues with some emails coming into Discourse with an initial email and then the mbox format replies in the same post, not formatted. I’m not sure what is causing this.

The question is, how can I delete all the imported mails (20 years worth) without deleting and recreating the target discourse instance?

I’m aware the recommended RAM requirement is 8GB. I did try importing 20 years of posts on a 2GB virtual machine and it ran for a while and crashed with the message ‘killed’. 8GB machines on hosting providers such as DigitalOcean are expensive (for me). Is there any way to do this with less memory? Import in smaller batches perhaps?

Maybe delete those categories and then delete the associated topic custom fields.

No, I don’t think you can do much of an import on a small machine. You could try on a desktop but then you’ve got bandwidth issues to get the database back to the internet.

1 Like

I know there is not much activity on this thread, but I can’t succeed in getting it working properly. Many of the mbox format emails I import are not split properly. The From lines look like this:

From MAILER-DAEMON Tue Nov 01 05:57:09 2022

But some messages have a correct import then in the same body have raw mbox format items starting with the typical From line. In other words, they are not being split. I don’t see that I need to modify the regex that does the splitting and I don’t know ruby so I can’t debug the import script.

I don’t know where to go from here. There’s 20 years of messages to import, so I can’t go through the imported messages by hand to fix them up. In short this script is not working for me. Why would I be the only one this happens to?

You’re not. My first paid discourse job was months cleaning up old mbox files that had be hand edited for some reason that I can’t recall.

It sounds like you do need to muck with the regex or find some other way to fix the errant messages. One way is to use some other tool to split the messages into one per file.

Fwiw, I wrote several import scripts before I knew ruby.

Every import is unique. With 20 years of data, it’s a good bet that you’ll have several different issues as things changed in the various systems that were used.

2 Likes

You bet. That’s for sure.

I want to import 20 years of messages from my mailman2 system into an archive directory, but I don’t want to create user IDs (not even staged ones) for them, as many of our subscribers have moved on or passed on and it would create many accounts that will just take up space.

Can I import them all under the same user ID (perhaps ‘archive’)?

And this may be a dumb question, but since the app is turned off during the import process, does that mean users who have signed up for emails about new posts won’t get flooded with emails about all the archives that were just loaded?

You can comment out the import_users function and all messages will be owned by system.

You’re not going to save much space.

No users will receive email until they have used the forgot password process to log in to their account. If you’re importing these data into an existing community then I believe that users will get notifications about the new messsages that are created by the import script.

1 Like

Thanks, I was looking through the import script and figured that I might be able to just disable the new user section. Testing that is on my list.

It isn’t file space I’m thinking about, it’s having possibly hundreds of staged user accounts that will never be used, so it’s more like head space or a very long user list.

You know your users but having accounts that no one will use seems much better than not knowing who posted 20 years worth of messages.

3 Likes

That’s a valid point, Jay.

I’m not finding the import_mbox.sh file and when I try executing the mbox.rb script directly, I get a bunch of Ruby errors:

root@lists-import:/var/www/discourse/script/import_scripts# ruby mbox.rb mbox
fatal: detected dubious ownership in repository at ‘/var/www/discourse’
To add an exception for this directory, call:

    git config --global --add safe.directory /var/www/discourse

/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/zeitwerk-2.6.7/lib/zeitwerk/loader/callbacks.rb:25:in `on_file_autoloaded’: expected file /var/www/discourse/lib/freedom_patches/pluck_first.rb to define constant FreedomPatches::PluckFirst, but didn’t (Zeitwerk::NameError)

  raise Zeitwerk::NameError.new(msg, cref.last)

Greetings folks. What a great guide. Thank you to Gerhard and others for contributing.

Has anyone here adapted this for Lyris? I’m interested in migrating a historic install and would like to understand if there were any special concerns they hit in a similar project.

I needed to import posts from a mailing list to Discourse, and ran into two problems.

  • sqlite3 was not found.
  • I could not find import_mbox.sh

Here are my solutions:

install sqlite3

I added to Gemfile:

 gem "sqlite3", "~> 1.3", ">= 1.3.13"

then run:

cd discourse
bundle config set frozen false
bundler install

run the import

cd discourse
RAILS_ENV=production bundle exec rails runner script/import_scripts/mbox.rb script/import_scripts/mbox/settings.yml
1 Like

You probably missed the following step which is hidden behind “Regular import” in 1.2. Preparing the Docker container.

1 Like