Importing mailing lists (mbox, Listserv, Google Groups, emails, ...)

This guide is for you if you want to migrate a mailing list to Discourse.
It also contains instructions for importing messages from image Google Groups.

1. Importing using Docker container

This is the recommended way for importing content from your mailing lists into Discourse.

1.1. Installing Discourse

:bulb: The import script most likely won’t work on systems with less than 4GB of RAM. Recommended are 8GB of RAM or more. You can scale back the RAM usage after the import if you like.

Install Discourse by following the official installation guide. Afterwards it’s a good idea to go to the Admin section and configure a few settings:

  • Enable login_required if imported topics shouldn’t be visible to the public

  • Enable hide_user_profiles_from_public if user profiles shouldn’t be visible to the public.

  • Disable download_remote_images_to_local if you don’t want Discourse to download images embedded in posts.

  • Enable disable_edit_notifications if you enabled download_remote_images_to_local and don’t want your users to get lots of notifications about posts edited by the system user.

  • Change the value of slug_generation_method if most of the topic titles use characters which shouldn’t be mapped to ASCII (e.g. Arabic). See this post for more information.

:bangbang: The following steps assume that you installed Discourse on Ubuntu and that you are connected to the machine via SSH or have direct access to the machine’s terminal.

1.2. Preparing the Docker container

Copy the container configuration file app.yml to import.yml and edit it with your favorite editor.

cd /var/discourse
cp containers/app.yml containers/import.yml
nano containers/import.yml
Regular import

Add - "templates/import/mbox.template.yml" to the list of templates. Afterwards it should look something like this:

templates:
  - "templates/postgres.template.yml"
  - "templates/redis.template.yml"
  - "templates/web.template.yml"
  - "templates/web.ratelimited.template.yml"
## Uncomment these two lines if you wish to add Lets Encrypt (https)
  #- "templates/web.ssl.template.yml"
  #- "templates/web.letsencrypt.ssl.template.yml"
  - "templates/import/mbox.template.yml"

That’s it. You can save the file, close the editor and build the container.

Google Groups import

You need to add two entries to the list of templates:

  - "templates/import/chrome-dep.template.yml"
  - "templates/import/mbox.template.yml"

Afterwards it should look something like this:

templates:
  - "templates/postgres.template.yml"
  - "templates/redis.template.yml"
  - "templates/web.template.yml"
  - "templates/web.ratelimited.template.yml"
## Uncomment these two lines if you wish to add Lets Encrypt (https)
  #- "templates/web.ssl.template.yml"
  #- "templates/web.letsencrypt.ssl.template.yml"
  - "templates/import/chrome-dep.template.yml"
  - "templates/import/mbox.template.yml"

That’s it. You can save the file, close the editor and build the container.

/var/discourse/launcher stop app
/var/discourse/launcher rebuild import

Building the container creates an import directory within the container’s shared directory. It looks like this:

/var/discourse/shared/standalone/import
├── data
└── settings.yml

1.3. Downloading messages from Google Groups (optional)

You can skip this step unless you want to migrate from image Google Groups.

Instructions for Google Groups

1.3.1. Preparation

:warning: Make sure you don’t have any pinned posts in your group, otherwise the crawler might fail to download some or all messages.

Google account: You need a Google account that has the Manager or Owner role for your Google Group, otherwise the downloaded messages will contain censored email addresses.

Group name: You can find the group name by visiting your Google Group and looking at the browser’s address bar. image

Domain name: The URL might look a little bit differently if you are a G Suite customer. You need to know the domain name if the URL contains something like example.com. image

1.3.2 Cookies :cookie:

In order to download messages, the crawler needs to have access to a Google account that has the owner role for your group. Please visit https://myaccount.google.com/ in your browser and sign in if you aren’t already logged in. Then use a browser extension of your choice to export your cookies for google.com in a file named cookies.txt.

The recommended browser extensions is Export Cookies for Mozilla Firefox.
cookies.txt in Firefox

Upload the cookies.txt file to your server and save it within the /var/discourse/shared/standalone/import directory.

1.3.3. Download messages

:bulb: Tip: It’s a good idea to download messages inside a tmux or screen session, so that you can reconnect to the session in case of SSH connection loss.

Let’s start by entering the Docker container.

/var/discourse/launcher enter import

Replace the <group_name> (and if applicable, the <domain_name>) placeholders within the following command with the group name and domain name from step 1.3.1 and execute it inside the Docker container in order to start the download of messages.

If you didn’t find a domain name in step 1.3.1, this is the command for you:

script/import_scripts/google_groups.rb -g <group_name>

Or, if you found a domain name in step 1.3.1, use this command instead:

script/import_scripts/google_groups.rb -g <group_name> -d <domain_name>

Downloading all messages can take a long time. It mostly depends on the number of topics in your Google Group. The script will show you a message like this when it’s finished: Done (00h 26min 52sec)

:bulb: Tip: You can abort the download anytime you want by pressing Ctrl+C
When you restart the download it will continue where it left off.

1.4. Configuring the importer

You can configure the importer by editing the example settings.yml file that has been copied into the import directory.

nano /var/discourse/shared/standalone/import/settings.yml

The settings file comes with sensible defaults, but here are a few tips anyway:

  • The settings file contains multiple examples on how to split data files:

    • mbox files usually are separated by a From header. Choose a regular expression that works for your files.

    • If each of your files contains only one message, set the split_regex to an empty string. This also applies to imports from image Google Groups.

    • There’s also an example for files from the popular Listserv mailing list software.

  • prefer_html allows you to configure if the import should use the HTML part of emails when it exists. You should choose what suits you best – it heavily depends on the emails sent to your mailing list.

  • By default each user imported from the mailing list is created as staged user. You can disable that behaviour by setting staged to false.

  • If your emails do not contain a Message-ID header (like messages stored by Listserv), you should enable the group_messages_by_subject setting.

1.5. Prepare files

Each subdirectory of /var/discourse/shared/standalone/import/data gets imported as its own category and each directory should contain the data files you want to import. The file names of those do not matter.

Example: The import directory should look like this if you want to import two mailing lists with multiple mbox files:

/var/discourse/shared/standalone/import
├── data
│   ├── list 1
│   │   ├── foo
│   │   ├── bar
│   ├── list 2
│   │   ├── 2017-12.mbox
│   │   ├── 2018-01.mbox
└── settings.yml

1.6. Executing the import script

:bulb: Tip: It’s a good idea to start the import inside a tmux or screen session, so that you can reconnect to the session in case of SSH connection loss.

Let’s start the import by entering the Docker container and launching the import script inside the Docker container.

/var/discourse/launcher enter import
import_mbox.sh # inside the Docker container

Depending on the size of your mailing lists it’s now time for some :coffee: or :sleeping:
The import script will show you a message like this when it’s finished: Done (00h 26min 52sec)

:bulb: Tip: You can abort the import anytime you want by pressing Ctrl+C
When you restart the import it will continue where it left off.

You can exit and stop the Docker container after the import has finished.

exit # inside the Docker container
/var/discourse/launcher stop import

1.7. Starting Discourse

Let’s start the app container and take a look at the imported data.

/var/discourse/launcher start app

Discourse will start and Sidekiq will begin post-processing all the imported posts. This can take a considerate amount of time. You can watch the progress by logging in as admin and visiting http://discourse.example.com/sidekiq

1.8. Clean up

So, you are satisfied with the result of the import and want to free some disk space? The following commands will delete the Docker container used for importing as well as all the files used during the import.

/var/discourse/launcher destroy import
rm /var/discourse/containers/import.yml
rm -R /var/discourse/shared/standalone/import

1.9. The End

Now it’s time to celebrate and enjoy your new Discourse instance! :tada:

2. FAQ

2.1. How can I remove list names (e.g. [Foo]) from topic titles during the import?

You have two options:

  1. rename the directory that contains the mbox files to Foo or
  2. create a metadata.yml file within the directory that contains the mbox files with the following content:
    name: "Foo"
    description: "The description is optional and will be used for the 'About category' topic"
    

2.2 How can I prevent the import script from detecting messages as already being imported?

:warning: The following steps will reset your Discourse forum to the initial state! You will need to start from scratch.

The following commands will stop the container, delete everything except the mbox files and the importer configuration and restart the container.

Commands
cd /var/discourse

./launcher stop app
./launcher stop import

rm -r ./shared/standalone/!(import)
rm ./shared/standalone/import/data/index.db

./launcher rebuild import

./launcher enter import
import_mbox.sh # inside the Docker container

2.3 How can I manipulate messages before they are imported into Discourse?

Enable index_only in settings.yml and take a look at the index.db (a SQLite database) before you run the actual import.

You can use SQL to update missing values in the database if you want. That way you don’t need to reindex any messages. The script uses only data from the index.db during the import phase. Simply disable the index_only option when you are done and rerun the importer. It will skip the indexing if none of the mbox files were changed, recalculate the content of the user and email_order tables and start the actual import process.

2.4 How can I find messages which cause problems during the import?

You can split mbox files into individual files to make it easier to find offending emails.

Commands
apt install procmail;
export FILENO=0000;
formail -ds sh -c 'cat &gt; split/msg.$FILENO' < mbox;

2.5 I have already imported a group. How can I import another group?

Create a new directory in the import/data directory and restart the import script.

2.6 I don’t have access to Mailman archives in mbox format? Is there any other way to get them?

You could give this script a try.

20 Likes

The crawler for Google Groups has been updated. Instead of trying to login automatically, it’s now using cookies of an active Google session supplied by the user. That should make it a lot more reliable. :crossed_fingers:

The instructions in the OP have been updated as well. Please give it a try.
cc @adammhaile, @Nkep_Kerlyn, @theaeolianmachine

3 Likes

After running the import script, I am able to successfully log in, but no posts are imported. the only data in the “<group_name>” in the import directory only holds a “status.yml” file.

Any suggestions?

Looks like you forgot to replace the group name when you started to download messages from Google Groups. I updated the instructions to make it more obvious.

2 Likes

Hm. I definitely replace it with the name of the Google Group - was just omitting it in the post.

so the script I run is as follows (radicle being the group name) ->

script/import_scripts/google_groups.rb -g radicle

With this input, I am able to create an index.db and radicle file in the data directory, but when I move within the radicle folder there is nothing but the status.yml file

Here’s the output of the command for reference. This also reflects the issue that no messages are being downloaded as it only takes a couple seconds…

root@discourse-import:/var/www/discourse# script/import_scripts/google_groups.rb -g radicle
Fetching gem metadata from https://rubygems.org/.........
Resolving dependencies...
Using bundler 1.17.3
Using childprocess 3.0.0
Using colored2 3.1.2
Using rubyzip 2.0.0
Using mini_portile2 2.4.0
Using selenium-webdriver 3.142.6
Using nokogiri 1.10.5
Using webdrivers 4.1.3

Logging in...

Done (00h 00min 03sec)

Is this some kind of private or hidden group? I’m seeing the same result as you when I try to import radicle because it doesn’t find anything to import.

Make sure that the user you are using for the import (cookies.txt) is able to see the group’s content. If that’s the case, then I have no idea what’s wrong. I tested with a public group and everything is working as expected.

4 Likes

User that I’m exporting cookies.txt for is definitely able to see the content.

The groups settings were for “only members” but I’ve since changed it to “anyone on the web”. Hasn’t solved the problem but I’ll keep trying.

If successful should see files under the radicle folder before import correct?

Another question - when I edit the templates in import.yml and rebuild the import - are the templates still supposed to be present if i nano container/import.yml again after the rebuild? I’m really really confused why this isn’t working

The script tries to find topics at https://groups.google.com/forum/?_escaped_fragment_=categories/radicle[1-100], but it fails to do so. I have no idea why your group is different from others.

I’m sorry, I do not understand what you are asking. Can you clarify?

Can you try with our group link ? Would be interesting to see if it works for you.

https://groups.google.com/a/monadic.xyz/forum/#!forum/radicle

I guess I was asking if my import is rebuilding correctly after I add the templates in the guide

2 Likes

Oh, that URL works and I see what’s happening. You are a GSuite customer, right? That’s why the URL is different than the ones I’ve seen before. I’m going to fix it.

And yes, you can rebuild the container whenever you want. The data is stored outside of the container.

1 Like

Yes! Okay, awesome - I’m sitting tight :slight_smile:

Please rebuild the “import” container and run the script with slightly different arguments.

script/import_scripts/google_groups.rb -g radicle -d monadic.xyz
2 Likes

Seems like it’s working!! Will build it out and let you know what happens. Thanks!

3 Likes

Everything worked perfectly. Thanks for the quick response and fix!

3 Likes

@gerhard - I was able to migrate an mbox archive of 22,000 messages using this script on a Digital Ocean droplet with only 1GB RAM. No problems. Thank you for the write-up of instructions. Everything worked great. The only mistake I made on my first attempt was trying to name the /var/discourse/shared/standalone/import/data/X subfolder using a new category I created before running the script. That caused the import to place these messages into the Uncategorized category. On second attempt, I deleted the new category and tried again. This created the category name for me and placed the messages into the proper category automatically.

3 Likes