This guide is for you if you want to migrate a mailing list to Discourse.
It also contains instructions for importing messages from Google Groups.
1. Importing using Docker container
This is the recommended way for importing content from your mailing lists into Discourse.
1.1. Installing Discourse
The import script most likely won’t work on systems with less than 4GB of RAM. Recommended are 8GB of RAM or more. You can scale back the RAM usage after the import if you like.
Install Discourse by following the official installation guide. Afterwards it’s a good idea to go to the Admin section and configure a few settings:
-
Enable
login_required
if imported topics shouldn’t be visible to the public -
Enable
hide_user_profiles_from_public
if user profiles shouldn’t be visible to the public. -
Disable
download_remote_images_to_local
if you don’t want Discourse to download images embedded in posts. -
Enable
disable_edit_notifications
if you enableddownload_remote_images_to_local
and don’t want your users to get lots of notifications about posts edited by the system user. -
Change the value of
slug_generation_method
if most of the topic titles use characters which shouldn’t be mapped to ASCII (e.g. Arabic). See this post for more information.
The following steps assume that you installed Discourse on Ubuntu and that you are connected to the machine via SSH or have direct access to the machine’s terminal.
1.2. Preparing the Docker container
Copy the container configuration file app.yml
to import.yml
and edit it with your favorite editor.
cd /var/discourse
cp containers/app.yml containers/import.yml
nano containers/import.yml
Regular import
Add - "templates/import/mbox.template.yml"
to the list of templates. Afterwards it should look something like this:
templates:
- "templates/postgres.template.yml"
- "templates/redis.template.yml"
- "templates/web.template.yml"
- "templates/web.ratelimited.template.yml"
## Uncomment these two lines if you wish to add Lets Encrypt (https)
#- "templates/web.ssl.template.yml"
#- "templates/web.letsencrypt.ssl.template.yml"
- "templates/import/mbox.template.yml"
That’s it. You can save the file, close the editor and build the container.
Google Groups import
You need to add two entries to the list of templates:
- "templates/import/chrome-dep.template.yml"
- "templates/import/mbox.template.yml"
Afterwards it should look something like this:
templates:
- "templates/postgres.template.yml"
- "templates/redis.template.yml"
- "templates/web.template.yml"
- "templates/web.ratelimited.template.yml"
## Uncomment these two lines if you wish to add Lets Encrypt (https)
#- "templates/web.ssl.template.yml"
#- "templates/web.letsencrypt.ssl.template.yml"
- "templates/import/chrome-dep.template.yml"
- "templates/import/mbox.template.yml"
That’s it. You can save the file, close the editor and build the container.
/var/discourse/launcher stop app
/var/discourse/launcher rebuild import
Building the container creates an import
directory within the container’s shared directory. It looks like this:
/var/discourse/shared/standalone/import
├── data
└── settings.yml
1.3. Downloading messages from Google Groups (optional)
You can skip this step unless you want to migrate from Google Groups.
Instructions for Google Groups
1.3.1. Preparation
Make sure you don’t have any pinned posts in your group, otherwise the crawler might fail to download some or all messages.
Make sure the group settings allow posting, otherwise you might see “Failed to scrape message” error messages. It might take a couple of minutes before the scraping works when you changed those settings recently.
Google account: You need a Google account that has the Manager or Owner role for your Google Group, otherwise the downloaded messages will contain censored email addresses.
Group name: You can find the group name by visiting your Google Group and looking at the browser’s address bar.
Domain name: The URL might look a little bit differently if you are a G Suite customer. You need to know the domain name if the URL contains something like example.com
.
1.3.2 Cookies
In order to download messages, the crawler needs to have access to a Google account that has the owner role for your group. Please visit https://myaccount.google.com/ in your browser and sign in if you aren’t already logged in. Then use a browser extension of your choice to export your cookies for google.com
in a file named cookies.txt
.
The recommended browser extensions is Export Cookies for Mozilla Firefox.
Upload the cookies.txt
file to your server and save it within the /var/discourse/shared/standalone/import
directory.
1.3.3. Download messages
Tip: It’s a good idea to download messages inside a tmux or screen session, so that you can reconnect to the session in case of SSH connection loss.
Let’s start by entering the Docker container.
/var/discourse/launcher enter import
Replace the <group_name>
(and if applicable, the <domain_name>
) placeholders within the following command with the group name and domain name from step 1.3.1 and execute it inside the Docker container in order to start the download of messages.
If you didn’t find a domain name in step 1.3.1, this is the command for you:
script/import_scripts/google_groups.rb -g <group_name>
Or, if you found a domain name in step 1.3.1, use this command instead:
script/import_scripts/google_groups.rb -g <group_name> -d <domain_name>
Downloading all messages can take a long time. It mostly depends on the number of topics in your Google Group. The script will show you a message like this when it’s finished: Done (00h 26min 52sec)
Tip: You can abort the download anytime you want by pressing Ctrl+C
When you restart the download it will continue where it left off.
1.4. Configuring the importer
You can configure the importer by editing the example settings.yml
file that has been copied into the import
directory.
nano /var/discourse/shared/standalone/import/settings.yml
The settings file comes with sensible defaults, but here are a few tips anyway:
-
The settings file contains multiple examples on how to split data files:
-
mbox files usually are separated by a
From
header. Choose a regular expression that works for your files. -
If each of your files contains only one message, set the
split_regex
to an empty string. This also applies to imports from Google Groups. -
There’s also an example for files from the popular Listserv mailing list software.
-
-
prefer_html
allows you to configure if the import should use the HTML part of emails when it exists. You should choose what suits you best – it heavily depends on the emails sent to your mailing list. -
By default each user imported from the mailing list is created as staged user. You can disable that behaviour by setting
staged
tofalse
. -
If your emails do not contain a
Message-ID
header (like messages stored by Listserv), you should enable thegroup_messages_by_subject
setting.
1.5. Prepare files
Each subdirectory of /var/discourse/shared/standalone/import/data
gets imported as its own category and each directory should contain the data files you want to import. The file names of those do not matter.
Example: The import
directory should look like this if you want to import two mailing lists with multiple mbox files:
/var/discourse/shared/standalone/import
├── data
│ ├── list 1
│ │ ├── foo
│ │ ├── bar
│ ├── list 2
│ │ ├── 2017-12.mbox
│ │ ├── 2018-01.mbox
└── settings.yml
1.6. Executing the import script
Tip: It’s a good idea to start the import inside a tmux or screen session, so that you can reconnect to the session in case of SSH connection loss.
Let’s start the import by entering the Docker container and launching the import script inside the Docker container.
/var/discourse/launcher enter import
import_mbox.sh # inside the Docker container
Depending on the size of your mailing lists it’s now time for some or
The import script will show you a message like this when it’s finished: Done (00h 26min 52sec)
Tip: You can abort the import anytime you want by pressing Ctrl+C
When you restart the import it will continue where it left off.
You can exit and stop the Docker container after the import has finished.
exit # inside the Docker container
/var/discourse/launcher stop import
1.7. Starting Discourse
Let’s start the app container and take a look at the imported data.
/var/discourse/launcher start app
Discourse will start and Sidekiq will begin post-processing all the imported posts. This can take a considerate amount of time. You can watch the progress by logging in as admin and visiting http://discourse.example.com/sidekiq
1.8. Clean up
So, you are satisfied with the result of the import and want to free some disk space? The following commands will delete the Docker container used for importing as well as all the files used during the import.
/var/discourse/launcher destroy import
rm /var/discourse/containers/import.yml
rm -R /var/discourse/shared/standalone/import
1.9. The End
Now it’s time to celebrate and enjoy your new Discourse instance!
2. FAQ
2.1. How can I remove list names (e.g. [Foo]
) from topic titles during the import?
You can use an empty tag to remove one or more prefixes from topic titles. The settings file contains an example.
2.2 How can I prevent the import script from detecting messages as already being imported?
The following steps will reset your Discourse forum to the initial state! You will need to start from scratch.
The following commands will stop the container, delete everything except the mbox files and the importer configuration and restart the container.
Commands
cd /var/discourse
./launcher stop app
./launcher stop import
rm -r ./shared/standalone/!(import)
rm ./shared/standalone/import/data/index.db
./launcher rebuild import
./launcher enter import
import_mbox.sh # inside the Docker container
2.3 How can I manipulate messages before they are imported into Discourse?
Enable index_only
in settings.yml
and take a look at the index.db
(a SQLite database) before you run the actual import.
You can use SQL to update missing values in the database if you want. That way you don’t need to reindex any messages. The script uses only data from the index.db
during the import phase. Simply disable the index_only
option when you are done and rerun the importer. It will skip the indexing if none of the mbox files were changed, recalculate the content of the user
and email_order
tables and start the actual import process.
2.4 How can I find messages which cause problems during the import?
You can split mbox files into individual files to make it easier to find offending emails.
Commands
apt install procmail;
export FILENO=0000;
formail -ds sh -c 'cat > split/msg.$FILENO' < mbox;
2.5 I have already imported a group. How can I import another group?
Create a new directory in the import/data
directory and restart the import script.
2.6 I don’t have access to Mailman archives in mbox format? Is there any other way to get them?
You could give this script a try.
Last edited by @JammyDodger 2024-05-27T14:56:11Z
Check document
Perform check on document: