Importing / migrating mailing lists (mbox, Listserv, Google Groups, emails, ...)

This guide is for you if you want to migrate a mailing list to Discourse.
It also contains instructions for importing messages from image Google Groups.

1. Importing using Docker container

This is the recommended way for importing content from your mailing lists into Discourse.

1.1. Installing Discourse

:bulb: The import script most likely won’t work on systems with less than 4GB of RAM. Recommended are 8GB of RAM or more. You can scale back the RAM usage after the import if you like.

Install Discourse by following the official installation guide. Afterwards it’s a good idea to go to the Admin section and configure a few settings:

  • Enable login_required if imported topics shouldn’t be visible to the public

  • Enable hide_user_profiles_from_public if user profiles shouldn’t be visible to the public.

  • Disable download_remote_images_to_local if you don’t want Discourse to download images embedded in posts.

  • Enable disable_edit_notifications if you enabled download_remote_images_to_local and don’t want your users to get lots of notifications about posts edited by the system user.

  • Change the value of slug_generation_method if most of the topic titles use characters which shouldn’t be mapped to ASCII (e.g. Arabic). See this post for more information.

:bangbang: The following steps assume that you installed Discourse on Ubuntu and that you are connected to the machine via SSH or have direct access to the machine’s terminal.

1.2. Preparing the Docker container

Copy the container configuration file app.yml to import.yml and edit it with your favorite editor.

cd /var/discourse
cp containers/app.yml containers/import.yml
nano containers/import.yml
Regular import

Add - "templates/import/mbox.template.yml" to the list of templates. Afterwards it should look something like this:

templates:
  - "templates/postgres.template.yml"
  - "templates/redis.template.yml"
  - "templates/web.template.yml"
  - "templates/web.ratelimited.template.yml"
## Uncomment these two lines if you wish to add Lets Encrypt (https)
  #- "templates/web.ssl.template.yml"
  #- "templates/web.letsencrypt.ssl.template.yml"
  - "templates/import/mbox.template.yml"

That’s it. You can save the file, close the editor and build the container.

Google Groups import

You need to add two entries to the list of templates:

  - "templates/import/chrome-dep.template.yml"
  - "templates/import/mbox.template.yml"

Afterwards it should look something like this:

templates:
  - "templates/postgres.template.yml"
  - "templates/redis.template.yml"
  - "templates/web.template.yml"
  - "templates/web.ratelimited.template.yml"
## Uncomment these two lines if you wish to add Lets Encrypt (https)
  #- "templates/web.ssl.template.yml"
  #- "templates/web.letsencrypt.ssl.template.yml"
  - "templates/import/chrome-dep.template.yml"
  - "templates/import/mbox.template.yml"

That’s it. You can save the file, close the editor and build the container.

/var/discourse/launcher stop app
/var/discourse/launcher rebuild import

Building the container creates an import directory within the container’s shared directory. It looks like this:

/var/discourse/shared/standalone/import
├── data
└── settings.yml

1.3. Downloading messages from Google Groups (optional)

You can skip this step unless you want to migrate from image Google Groups.

Instructions for Google Groups

1.3.1. Preparation

:warning: Make sure you don’t have any pinned posts in your group, otherwise the crawler might fail to download some or all messages.

:warning: Make sure the group settings allow posting, otherwise you might see “Failed to scrape message” error messages. It might take a couple of minutes before the scraping works when you changed those settings recently.

Google account: You need a Google account that has the Manager or Owner role for your Google Group, otherwise the downloaded messages will contain censored email addresses.

Group name: You can find the group name by visiting your Google Group and looking at the browser’s address bar. image

Domain name: The URL might look a little bit differently if you are a G Suite customer. You need to know the domain name if the URL contains something like example.com. image

1.3.2 Cookies :cookie:

In order to download messages, the crawler needs to have access to a Google account that has the owner role for your group. Please visit https://myaccount.google.com/ in your browser and sign in if you aren’t already logged in. Then use a browser extension of your choice to export your cookies for google.com in a file named cookies.txt.

The recommended browser extensions is Export Cookies for Mozilla Firefox.
cookies.txt in Firefox

Upload the cookies.txt file to your server and save it within the /var/discourse/shared/standalone/import directory.

1.3.3. Download messages

:bulb: Tip: It’s a good idea to download messages inside a tmux or screen session, so that you can reconnect to the session in case of SSH connection loss.

Let’s start by entering the Docker container.

/var/discourse/launcher enter import

Replace the <group_name> (and if applicable, the <domain_name>) placeholders within the following command with the group name and domain name from step 1.3.1 and execute it inside the Docker container in order to start the download of messages.

If you didn’t find a domain name in step 1.3.1, this is the command for you:

script/import_scripts/google_groups.rb -g <group_name>

Or, if you found a domain name in step 1.3.1, use this command instead:

script/import_scripts/google_groups.rb -g <group_name> -d <domain_name>

Downloading all messages can take a long time. It mostly depends on the number of topics in your Google Group. The script will show you a message like this when it’s finished: Done (00h 26min 52sec)

:bulb: Tip: You can abort the download anytime you want by pressing Ctrl+C
When you restart the download it will continue where it left off.

1.4. Configuring the importer

You can configure the importer by editing the example settings.yml file that has been copied into the import directory.

nano /var/discourse/shared/standalone/import/settings.yml

The settings file comes with sensible defaults, but here are a few tips anyway:

  • The settings file contains multiple examples on how to split data files:

    • mbox files usually are separated by a From header. Choose a regular expression that works for your files.

    • If each of your files contains only one message, set the split_regex to an empty string. This also applies to imports from image Google Groups.

    • There’s also an example for files from the popular Listserv mailing list software.

  • prefer_html allows you to configure if the import should use the HTML part of emails when it exists. You should choose what suits you best – it heavily depends on the emails sent to your mailing list.

  • By default each user imported from the mailing list is created as staged user. You can disable that behaviour by setting staged to false.

  • If your emails do not contain a Message-ID header (like messages stored by Listserv), you should enable the group_messages_by_subject setting.

1.5. Prepare files

Each subdirectory of /var/discourse/shared/standalone/import/data gets imported as its own category and each directory should contain the data files you want to import. The file names of those do not matter.

Example: The import directory should look like this if you want to import two mailing lists with multiple mbox files:

/var/discourse/shared/standalone/import
├── data
│   ├── list 1
│   │   ├── foo
│   │   ├── bar
│   ├── list 2
│   │   ├── 2017-12.mbox
│   │   ├── 2018-01.mbox
└── settings.yml

1.6. Executing the import script

:bulb: Tip: It’s a good idea to start the import inside a tmux or screen session, so that you can reconnect to the session in case of SSH connection loss.

Let’s start the import by entering the Docker container and launching the import script inside the Docker container.

/var/discourse/launcher enter import
import_mbox.sh # inside the Docker container

Depending on the size of your mailing lists it’s now time for some :coffee: or :sleeping:
The import script will show you a message like this when it’s finished: Done (00h 26min 52sec)

:bulb: Tip: You can abort the import anytime you want by pressing Ctrl+C
When you restart the import it will continue where it left off.

You can exit and stop the Docker container after the import has finished.

exit # inside the Docker container
/var/discourse/launcher stop import

1.7. Starting Discourse

Let’s start the app container and take a look at the imported data.

/var/discourse/launcher start app

Discourse will start and Sidekiq will begin post-processing all the imported posts. This can take a considerate amount of time. You can watch the progress by logging in as admin and visiting http://discourse.example.com/sidekiq

1.8. Clean up

So, you are satisfied with the result of the import and want to free some disk space? The following commands will delete the Docker container used for importing as well as all the files used during the import.

/var/discourse/launcher destroy import
rm /var/discourse/containers/import.yml
rm -R /var/discourse/shared/standalone/import

1.9. The End

Now it’s time to celebrate and enjoy your new Discourse instance! :tada:

2. FAQ

2.1. How can I remove list names (e.g. [Foo]) from topic titles during the import?

You can use an empty tag to remove one or more prefixes from topic titles. The settings file contains an example.

2.2 How can I prevent the import script from detecting messages as already being imported?

:warning: The following steps will reset your Discourse forum to the initial state! You will need to start from scratch.

The following commands will stop the container, delete everything except the mbox files and the importer configuration and restart the container.

Commands
cd /var/discourse

./launcher stop app
./launcher stop import

rm -r ./shared/standalone/!(import)
rm ./shared/standalone/import/data/index.db

./launcher rebuild import

./launcher enter import
import_mbox.sh # inside the Docker container

2.3 How can I manipulate messages before they are imported into Discourse?

Enable index_only in settings.yml and take a look at the index.db (a SQLite database) before you run the actual import.

You can use SQL to update missing values in the database if you want. That way you don’t need to reindex any messages. The script uses only data from the index.db during the import phase. Simply disable the index_only option when you are done and rerun the importer. It will skip the indexing if none of the mbox files were changed, recalculate the content of the user and email_order tables and start the actual import process.

2.4 How can I find messages which cause problems during the import?

You can split mbox files into individual files to make it easier to find offending emails.

Commands
apt install procmail;
export FILENO=0000;
formail -ds sh -c 'cat &gt; split/msg.$FILENO' < mbox;

2.5 I have already imported a group. How can I import another group?

Create a new directory in the import/data directory and restart the import script.

2.6 I don’t have access to Mailman archives in mbox format? Is there any other way to get them?

You could give this script a try.

25 Likes

@gerhard - I was able to migrate an mbox archive of 22,000 messages using this script on a Digital Ocean droplet with only 1GB RAM. No problems. Thank you for the write-up of instructions. Everything worked great. The only mistake I made on my first attempt was trying to name the /var/discourse/shared/standalone/import/data/X subfolder using a new category I created before running the script. That caused the import to place these messages into the Uncategorized category. On second attempt, I deleted the new category and tried again. This created the category name for me and placed the messages into the proper category automatically.

5 Likes

Thanks for this guide.

I’m attempting to do a Google Groups import. Unfortunately, I run into this error when running import_mbox.sh:

The mbox import is starting...

Traceback (most recent call last):
5: from script/import_scripts/mbox.rb:9:in `<main>'
4: from script/import_scripts/mbox.rb:10:in `<module:ImportScripts>'
3: from script/import_scripts/mbox.rb:13:in `<module:Mbox>'
2: from /var/www/discourse/script/import_scripts/mbox/support/settings.rb:9:in `load'
1: from /var/www/discourse/script/import_scripts/mbox/support/settings.rb:9:in `new'

/var/www/discourse/script/import_scripts/mbox/support/settings.rb:42:in `initialize': undefined method `each' for nil:NilClass (NoMethodError)

All files in /var/discourse/shared/standalone/import/data/Foo are .eml files though, not mbox. Does that matter?

Thanks!

The latest version of the import script fixes that problem. As an alternative, please update your settings file. There were some recent changes.

4 Likes

Thanks a lot. Could you please give some advice on how to update the import script?

Is it enough to just update the import scripts or do I have to redo more steps of the guide (which ones?)? I can’t find them and therefore don’t know how to update them.

I did update the settings file as you mentioned this being an alternative, but I’m getting the same issues.

Thanks.

You can run /var/discourse/launcher rebuild import to update the import script and everything else related to it.

3 Likes

Thanks.

While running import_mbox.sh almost all messages are skipped with messages like the following:

script/import_scripts/mbox.rb:12:in `<module:Mbox>'

script/import_scripts/mbox.rb:10:in `<module:ImportScripts>'

script/import_scripts/mbox.rb:9:in `<main>'

41 / 215 ( 19.1%) [59096 items/min] Failed to map post for 36a37072-e5b6-4009-878f-f0824e40eac6@googlegroups.com

undefined method `each' for nil:NilClass

/var/www/discourse/script/import_scripts/mbox/importer.rb:179:in `block in remove_tags!'

/var/www/discourse/script/import_scripts/mbox/importer.rb:176:in `loop'

/var/www/discourse/script/import_scripts/mbox/importer.rb:176:in `remove_tags!'

/var/www/discourse/script/import_scripts/mbox/importer.rb:150:in `map_first_post'

/var/www/discourse/script/import_scripts/mbox/importer.rb:104:in `block (2 levels) in import_posts'

/var/www/discourse/script/import_scripts/base.rb:503:in `block in create_posts'

/var/www/discourse/script/import_scripts/base.rb:502:in `each'

/var/www/discourse/script/import_scripts/base.rb:502:in `create_posts'

/var/www/discourse/script/import_scripts/mbox/importer.rb:98:in `block in import_posts'

/var/www/discourse/script/import_scripts/base.rb:882:in `block in batches'

/var/www/discourse/script/import_scripts/base.rb:881:in `loop'

/var/www/discourse/script/import_scripts/base.rb:881:in `batches'

/var/www/discourse/script/import_scripts/mbox/importer.rb:84:in `batches'

/var/www/discourse/script/import_scripts/mbox/importer.rb:92:in `import_posts'

/var/www/discourse/script/import_scripts/mbox/importer.rb:36:in `execute'

/var/www/discourse/script/import_scripts/base.rb:47:in `perform'

And further down:

60 / 215 ( 27.9%) [58321 items/min] Parent message 1b46f337-95a3-4b4a-a14a-689636941580@googlegroups.com doesn't exist. Skipping 5634208e-e6df-4bd8-b361-0735f73fe554@googlegroups.com:

What could be the reason for this? Thanks.

The problem should be fixed. Please rebuild your import container one more time.

4 Likes

Sweet, it worked like a charm. :pray: Thanks so much for your support.

3 Likes

I’m trying to download Google Groups and am getting

Failed to login. Please check the content of your cookies.txt

I used the recommended Firefox extension to download the cookies. Once yesterday and again today. I’ve confirmed that it’s reading the file by renaming it to something wrong and getting a “not found” error. I downloaded all the cookies, not just google ones. I logged out and back in and downloaded the cookies again.

I can see that I’m a manager because I have the “manage group” options.

I’ve triple-rechecked that I’m using the right group name by copy-pasting and seeing that it’s a group name format and not a domain name one.

Is something broken or is it just me?

@gerhard, sorry for the call-out, but have you a quick suggestion on how to debug this? Maybe a login endpoint has changed?

EDIT: Found it. I’ll submit a PR shortly. The endpoint for login changed and I managed to guess the new one. :slight_smile:

https://github.com/discourse/discourse/pull/9432

1 Like

Newbie trying to import mbox files from Yahoo groups. I’ve followed these instructions several times but always with the same error message. I see others have been successful so it is likely a newbie mistake. The error appears to indicate that split_regex: "^From .+@.+" is not finding the email key to split the file but I tested the regex in a text editor and it works as expected. Line 2 of the import file is similar to Message-ID: <35690.0.1.959300741@eGroups.com>
Any ideas? TIA…

The mbox import is starting...

Traceback (most recent call last):
	12: from script/import_scripts/mbox.rb:9:in `<main>'
	11: from script/import_scripts/mbox.rb:10:in `<module:ImportScripts>'
	10: from script/import_scripts/mbox.rb:12:in `<module:Mbox>'
	 9: from script/import_scripts/mbox.rb:12:in `new'
	 8: from /var/www/discourse/script/import_scripts/mbox/importer.rb:11:in `initialize'
	 7: from /var/www/discourse/script/import_scripts/mbox/support/settings.rb:8:in `load'
	 6: from /usr/local/lib/ruby/2.6.0/psych.rb:577:in `load_file'
	 5: from /usr/local/lib/ruby/2.6.0/psych.rb:577:in `open'
	 4: from /usr/local/lib/ruby/2.6.0/psych.rb:578:in `block in load_file'
	 3: from /usr/local/lib/ruby/2.6.0/psych.rb:277:in `load'
	 2: from /usr/local/lib/ruby/2.6.0/psych.rb:390:in `parse'
	 1: from /usr/local/lib/ruby/2.6.0/psych.rb:456:in `parse_stream'
/usr/local/lib/ruby/2.6.0/psych.rb:456:in `parse': (/shared/import/settings.yml): did not find expected key while parsing a block mapping at line 2 column 1 (Psych::SyntaxError)

Looks like you made an error in the settings.yml file. I suggest you validate the configuration at http://www.yamllint.com/

2 Likes

Thanks @gerhard Sigh…I should have seen that issue, my first bout with Ruby. Now, I think I’m a little closer but a different error (see below). Since the import script is now loading Groups, etc., I assume the new error is past the initial problem. I also assume the referenced db file is import/index.db created by the import script (not created).

The mbox import is starting...

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...
Traceback (most recent call last):
	9: from script/import_scripts/mbox.rb:9:in `<main>'
	8: from script/import_scripts/mbox.rb:10:in `<module:ImportScripts>'
	7: from script/import_scripts/mbox.rb:12:in `<module:Mbox>'
	6: from script/import_scripts/mbox.rb:12:in `new'
	5: from /var/www/discourse/script/import_scripts/mbox/importer.rb:14:in `initialize'
	4: from /var/www/discourse/script/import_scripts/mbox/importer.rb:14:in `new'
	3: from /var/www/discourse/script/import_scripts/mbox/support/database.rb:10:in `initialize'
	2: from /var/www/discourse/script/import_scripts/mbox/support/database.rb:10:in `new'
	1: from /var/www/discourse/vendor/bundle/ruby/2.6.0/gems/sqlite3-1.4.2/lib/sqlite3/database.rb:89:in `initialize'
/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/sqlite3-1.4.2/lib/sqlite3/database.rb:89:in `open_v2': unable to open database file (SQLite3::CantOpenException)

SYSTEM won’t allow me to edit my comment so I am submitting this reply instead.

EDIT: To close the loop…My Yahoo Group import is now working, at least to the point of indexing 9951 emails. I have not yet finished the full import so more to come. I have edited settings.yml many times and am now back to the original which suddenly seems to work! without the syntax error. I don’t understand why I have had numerous error messages that appear inconsistent to me. The original syntax error in the settings.yml is again a mystery. The above error mag makes no sense to me…sigh.

1 Like

@gerhard. I think I have found a way easier method of doing exactly the same as your guide, but with no technical knowledge required nor need for admin access to any server. Let me know what you think.

Overview

We’ll be essentially configuring a mailinglist and then using an email archive to send past conversations in order. Those emails will be forwarded, but not like the “Forward” button on email clients (that would override the headers and mess up the indentation). What we want to do is to remail them (send as they had been sent to discourse in the first place).

Requirements and Assumptions

  • Access to the previous email exchanges: someone who has stored it all on their email client and can volunteer to forward it – let’s call that person John Doe.

  • Time: the email email forwarding will be very slow so discourse can handle (perhaps a few days with a computer running uploading the emails – depending on the archive size)

  • Thunderbird client: We also assume here John Doe uses the email client “thunderbird”. It may be possible to do this with other clients but I haven’t looked.

The following guide uses two email addresses as placeholders. You need to replace them with your actual addresses.

:incoming_envelope: johndoe@example.com John Doe’s email (the person will forward the full mailing list archive)

:postbox: discourse+mailinglist-3@discoursemail.com discourse email for forwarding emails to the category of the mailinglist (see setup 1. for how you get it)

Instructions

Here’s a basic rundown of the instructions:

  1. follow the guide on Creating a read-only mailing list mirror to create a mirror of your mailing list

    Note: this will only mirror your mailinglist going forward. You’ll still miss out on past conversations. That’s what the rest of this guide is for.

  2. Change the way discourse forwards emails to (I’m not actually sure this is needed)
    forwarded_behavior

  3. Edit the category’s settings and under the setting Custom incoming email address: add at the end of what’s there |johndoe@example.com.

    The pipe here works like a , as to say that you also want johndoe@example.com to be able to send to that category

  4. John Doe installs on thunderbird the extension Mail Redirect.

    This is because it’s no regular email forward. What this will do is send the email as if it had gone to the discourse’s email address in the first place instead of John Doe’s

  5. John Doe goes to the extension’s settings and sets the following to 1 (default is 5)
    mail_redirect

    This will make sure the replies arrive in order: otherwise discourse isn’t quick enough to realise that the replies are chained and just creates a new topic for every reply – but it will make the forwarding process very slow

  6. John Doe selects all of the mailinglist’s past emails, right-clicks and clicks on Redirect. Then a new window will open and he add discourse+mailinglist-3@discoursemail.com as the Resend-to

After this John Doe’s email client will be slowly sending the email archives to discourse. Just check after some time to see if the discourse category is getting filled with some nostalgically old conversations.

Cleanup

  • Remove John Doe’s email from that category’s Custom incoming email address:setting (and don’t forget to remove the |)

  • Uninstall Mail Redirect extension – you’ll likely not need it again, or at the very least increase back the SMTP connections to 5.

4 Likes

We are trying to migrate our Mailman lists into an already running discourse instance. There are several private lists included for which we need permissions set for the corresponding category. When creating those categories before the import, all the posts for the private lists are added to “Uncategorized” (so automatically public).

So we have two alternative questions:

  • Is there a way to set permissions for the imported mailing lists (if they would be only admin-visible, it would already be sufficient for us) before import?
  • Is there a way to add the mailing list to an existing category (with preset permissions)?
2 Likes

My discourse is the continuation of a Yahoo group, which itself was a continuation an AOL listserv. Last fall, in the face of the great Yahoo purge, I was able to download a .mbox archive of the Yahoo group, and import those messages following these instructions. I’ve now gotten a partial archive of the AOL listserv, and I’d like to import those messages as well.

Easy enough, right? Just make import/data/foo, put the messages there, and run the import script. But what I’m wondering about is if I later manage to get a complete (or a more-complete) archive. Can I just put those files into import/data/foo, run the import script again, and have it add the new messages to the same category?

  • Would it de-dupe? Or would I see multiple copies of messages that appeared in both archives?
    • Would it change the answer to this question if one, the other, or both of the archives lacked message-id headers?
  • Would a new import in the same category overwrite existing messages?
  • Most of my users are in mailing-list mode. If I don’t want to spam them with hundreds (or thousands) of notifications, not to mention run up an expensive Mailgun bill, I assume I’ll want to disable email site-wide while the import is going on?
2 Likes

Unfortunately that’s not possible.

Yes, you can trick the import script into reusing existing categories.

./launcher enter app
rails c

# Use the category ID shown in the URL, for example
# it's 56 when the category's path looks like this: /c/howto/devs/56
category = Category.find(56)

# Use the directory name where the mbox files are stored. For example,
# when the files are stored in import/data/foo, you should use "foo" as directory name.
category.custom_fields["import_id"] = "directory_name"
category.save!

That’s unexpected. I’ve never seen that happen, but I’ve never tried to import into existing categories with permissions other than the default permissions.

If you can’t get it to work I’d suggest you post an announcement on your forum, switch your site into read-only mode, create a backup, restore the backup on a different server, run the import, configure the category permissions, create another backup and restore it on your production site.

2 Likes

Yes, you can. You might want to keep the import/data/index.db file around, just in case you want to have a look at the previously imported data, need to modify generated message IDs or whatnot…

Yes, it wouldn’t import already imported messages as long as the Message-ID header stays the same. You are out of luck if the Message-ID header is missing in only one of the archives. We use the MD5 hash of the message if the header is missing. You’d need to ensure that both messages either have the same Message-ID header or result in the same MD5 hash.

No.

All outgoing emails are disabled during imports.

2 Likes

Yes, you can trick the import script into reusing existing categories.

Ok, that is basically what we did now in the end (we used Category.find_by_name() instead, but I guess that’s just semantics). Good to know we choose the “correct” way :wink: . Thanks!

2 Likes