Importing / migrating mailing lists (mbox, Listserv, Google Groups, emails, ...)

I’m trying to download Google Groups and am getting

Failed to login. Please check the content of your cookies.txt

I used the recommended Firefox extension to download the cookies. Once yesterday and again today. I’ve confirmed that it’s reading the file by renaming it to something wrong and getting a “not found” error. I downloaded all the cookies, not just google ones. I logged out and back in and downloaded the cookies again.

I can see that I’m a manager because I have the “manage group” options.

I’ve triple-rechecked that I’m using the right group name by copy-pasting and seeing that it’s a group name format and not a domain name one.

Is something broken or is it just me?

@gerhard, sorry for the call-out, but have you a quick suggestion on how to debug this? Maybe a login endpoint has changed?

EDIT: Found it. I’ll submit a PR shortly. The endpoint for login changed and I managed to guess the new one. :slight_smile:

https://github.com/discourse/discourse/pull/9432

1 Like

Newbie trying to import mbox files from Yahoo groups. I’ve followed these instructions several times but always with the same error message. I see others have been successful so it is likely a newbie mistake. The error appears to indicate that split_regex: "^From .+@.+" is not finding the email key to split the file but I tested the regex in a text editor and it works as expected. Line 2 of the import file is similar to Message-ID: <35690.0.1.959300741@eGroups.com>
Any ideas? TIA…

The mbox import is starting...

Traceback (most recent call last):
	12: from script/import_scripts/mbox.rb:9:in `<main>'
	11: from script/import_scripts/mbox.rb:10:in `<module:ImportScripts>'
	10: from script/import_scripts/mbox.rb:12:in `<module:Mbox>'
	 9: from script/import_scripts/mbox.rb:12:in `new'
	 8: from /var/www/discourse/script/import_scripts/mbox/importer.rb:11:in `initialize'
	 7: from /var/www/discourse/script/import_scripts/mbox/support/settings.rb:8:in `load'
	 6: from /usr/local/lib/ruby/2.6.0/psych.rb:577:in `load_file'
	 5: from /usr/local/lib/ruby/2.6.0/psych.rb:577:in `open'
	 4: from /usr/local/lib/ruby/2.6.0/psych.rb:578:in `block in load_file'
	 3: from /usr/local/lib/ruby/2.6.0/psych.rb:277:in `load'
	 2: from /usr/local/lib/ruby/2.6.0/psych.rb:390:in `parse'
	 1: from /usr/local/lib/ruby/2.6.0/psych.rb:456:in `parse_stream'
/usr/local/lib/ruby/2.6.0/psych.rb:456:in `parse': (/shared/import/settings.yml): did not find expected key while parsing a block mapping at line 2 column 1 (Psych::SyntaxError)

Looks like you made an error in the settings.yml file. I suggest you validate the configuration at http://www.yamllint.com/

2 Likes

Thanks @gerhard Sigh…I should have seen that issue, my first bout with Ruby. Now, I think I’m a little closer but a different error (see below). Since the import script is now loading Groups, etc., I assume the new error is past the initial problem. I also assume the referenced db file is import/index.db created by the import script (not created).

The mbox import is starting...

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...
Traceback (most recent call last):
	9: from script/import_scripts/mbox.rb:9:in `<main>'
	8: from script/import_scripts/mbox.rb:10:in `<module:ImportScripts>'
	7: from script/import_scripts/mbox.rb:12:in `<module:Mbox>'
	6: from script/import_scripts/mbox.rb:12:in `new'
	5: from /var/www/discourse/script/import_scripts/mbox/importer.rb:14:in `initialize'
	4: from /var/www/discourse/script/import_scripts/mbox/importer.rb:14:in `new'
	3: from /var/www/discourse/script/import_scripts/mbox/support/database.rb:10:in `initialize'
	2: from /var/www/discourse/script/import_scripts/mbox/support/database.rb:10:in `new'
	1: from /var/www/discourse/vendor/bundle/ruby/2.6.0/gems/sqlite3-1.4.2/lib/sqlite3/database.rb:89:in `initialize'
/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/sqlite3-1.4.2/lib/sqlite3/database.rb:89:in `open_v2': unable to open database file (SQLite3::CantOpenException)

SYSTEM won’t allow me to edit my comment so I am submitting this reply instead.

EDIT: To close the loop…My Yahoo Group import is now working, at least to the point of indexing 9951 emails. I have not yet finished the full import so more to come. I have edited settings.yml many times and am now back to the original which suddenly seems to work! without the syntax error. I don’t understand why I have had numerous error messages that appear inconsistent to me. The original syntax error in the settings.yml is again a mystery. The above error mag makes no sense to me…sigh.

1 Like

@gerhard. I think I have found a way easier method of doing exactly the same as your guide, but with no technical knowledge required nor need for admin access to any server. Let me know what you think.

Overview

We’ll be essentially configuring a mailinglist and then using an email archive to send past conversations in order. Those emails will be forwarded, but not like the “Forward” button on email clients (that would override the headers and mess up the indentation). What we want to do is to remail them (send as they had been sent to discourse in the first place).

Requirements and Assumptions

  • Access to the previous email exchanges: someone who has stored it all on their email client and can volunteer to forward it – let’s call that person John Doe.

  • Time: the email email forwarding will be very slow so discourse can handle (perhaps a few days with a computer running uploading the emails – depending on the archive size)

  • Thunderbird client: We also assume here John Doe uses the email client “thunderbird”. It may be possible to do this with other clients but I haven’t looked.

The following guide uses two email addresses as placeholders. You need to replace them with your actual addresses.

:incoming_envelope: johndoe@example.com John Doe’s email (the person will forward the full mailing list archive)

:postbox: discourse+mailinglist-3@discoursemail.com discourse email for forwarding emails to the category of the mailinglist (see setup 1. for how you get it)

Instructions

Here’s a basic rundown of the instructions:

  1. follow the guide on Creating a read-only mailing list mirror to create a mirror of your mailing list

    Note: this will only mirror your mailinglist going forward. You’ll still miss out on past conversations. That’s what the rest of this guide is for.

  2. Change the way discourse forwards emails to (I’m not actually sure this is needed)
    forwarded_behavior

  3. Edit the category’s settings and under the setting Custom incoming email address: add at the end of what’s there |johndoe@example.com.

    The pipe here works like a , as to say that you also want johndoe@example.com to be able to send to that category

  4. John Doe installs on thunderbird the extension Mail Redirect.

    This is because it’s no regular email forward. What this will do is send the email as if it had gone to the discourse’s email address in the first place instead of John Doe’s

  5. John Doe goes to the extension’s settings and sets the following to 1 (default is 5)
    mail_redirect

    This will make sure the replies arrive in order: otherwise discourse isn’t quick enough to realise that the replies are chained and just creates a new topic for every reply – but it will make the forwarding process very slow

  6. John Doe selects all of the mailinglist’s past emails, right-clicks and clicks on Redirect. Then a new window will open and he add discourse+mailinglist-3@discoursemail.com as the Resend-to

After this John Doe’s email client will be slowly sending the email archives to discourse. Just check after some time to see if the discourse category is getting filled with some nostalgically old conversations.

Cleanup

  • Remove John Doe’s email from that category’s Custom incoming email address:setting (and don’t forget to remove the |)

  • Uninstall Mail Redirect extension – you’ll likely not need it again, or at the very least increase back the SMTP connections to 5.

4 Likes

We are trying to migrate our Mailman lists into an already running discourse instance. There are several private lists included for which we need permissions set for the corresponding category. When creating those categories before the import, all the posts for the private lists are added to “Uncategorized” (so automatically public).

So we have two alternative questions:

  • Is there a way to set permissions for the imported mailing lists (if they would be only admin-visible, it would already be sufficient for us) before import?
  • Is there a way to add the mailing list to an existing category (with preset permissions)?
2 Likes

My discourse is the continuation of a Yahoo group, which itself was a continuation an AOL listserv. Last fall, in the face of the great Yahoo purge, I was able to download a .mbox archive of the Yahoo group, and import those messages following these instructions. I’ve now gotten a partial archive of the AOL listserv, and I’d like to import those messages as well.

Easy enough, right? Just make import/data/foo, put the messages there, and run the import script. But what I’m wondering about is if I later manage to get a complete (or a more-complete) archive. Can I just put those files into import/data/foo, run the import script again, and have it add the new messages to the same category?

  • Would it de-dupe? Or would I see multiple copies of messages that appeared in both archives?
    • Would it change the answer to this question if one, the other, or both of the archives lacked message-id headers?
  • Would a new import in the same category overwrite existing messages?
  • Most of my users are in mailing-list mode. If I don’t want to spam them with hundreds (or thousands) of notifications, not to mention run up an expensive Mailgun bill, I assume I’ll want to disable email site-wide while the import is going on?
2 Likes

Unfortunately that’s not possible.

Yes, you can trick the import script into reusing existing categories.

./launcher enter app
rails c

# Use the category ID shown in the URL, for example
# it's 56 when the category's path looks like this: /c/howto/devs/56
category = Category.find(56)

# Use the directory name where the mbox files are stored. For example,
# when the files are stored in import/data/foo, you should use "foo" as directory name.
category.custom_fields["import_id"] = "directory_name"
category.save!

That’s unexpected. I’ve never seen that happen, but I’ve never tried to import into existing categories with permissions other than the default permissions.

If you can’t get it to work I’d suggest you post an announcement on your forum, switch your site into read-only mode, create a backup, restore the backup on a different server, run the import, configure the category permissions, create another backup and restore it on your production site.

2 Likes

Yes, you can. You might want to keep the import/data/index.db file around, just in case you want to have a look at the previously imported data, need to modify generated message IDs or whatnot…

Yes, it wouldn’t import already imported messages as long as the Message-ID header stays the same. You are out of luck if the Message-ID header is missing in only one of the archives. We use the MD5 hash of the message if the header is missing. You’d need to ensure that both messages either have the same Message-ID header or result in the same MD5 hash.

No.

All outgoing emails are disabled during imports.

2 Likes

Yes, you can trick the import script into reusing existing categories.

Ok, that is basically what we did now in the end (we used Category.find_by_name() instead, but I guess that’s just semantics). Good to know we choose the “correct” way :wink: . Thanks!

2 Likes

I’m trying to import a standard mbox dump from a mailing list but running into “Process killed” issues, typically after a long time spent on the “indexing […]mbox” step. This is from a large mbox file from an open-source project with ten years of posts.

Things I’ve tried:

  • Splitting the mbox file into chunks. This did work partially, and successfully imported many posts, but I’m now stuck on the indexing of one of these chunks. I tried to divide that file into chunks too, the first of which imported eventually, and the second of which now seems stalled.

  • Increasing the memory available on our server. Memory usage slowly increases during indexing and currently plateaus at around 16 GB (out of 32 GB) for an attempted import of one of these chunks, a 80 MB mbox file:

During this time, 1 CPU continues to be maxed out.

Any advice would be much appreciated, in particular increasing verbosity of debug output if maybe it’s getting stuck on a particular post. The index.db file in the import folder is about 800 MB.

I’m new to Ruby and do not regularly use SQL, so I’m finding it difficult to figure out what’s going on. Also this 32 GB server is expensive so I’d like to resize it back to 4 GB soon :slight_smile:

Thanks for any help!

1 Like

I guess the parser hangs on one particular email in that mbox. The index.db is an SQLite database. Take a look at the email table, filter by the mbox filename in the filename column and find the highest value within the last_line_number column. It’s highly likely that the parser hangs at the next email after that line number within the mbox file.

2 Likes

Many thanks @gerhard, I’ve managed to identify the last successfully indexed email and the email immediately afterwards that (I assume) is causing the hang. However, there doesn’t seem to be anything exceptional about these emails to me. Is it ok if I send you these two example emails in a private message to see if anything stands out?

Sure, you can send me a PM. And try removing those emails from the mbox file and test if the indexing works.

2 Likes

Thanks, sent. I wasn’t able to PM yourself directly so sent it via the team group, I hope this is ok. I will try removing the email too and see how far along we get before another hang.

1 Like

Thanks for the emails. I didn’t see anything out of the ordinary and in my tests it worked without problems.

This is untested, but you could try to apply the following git patch before running the import script. It adds a 60 second timeout for parsing an email. That might help you in finding the culprit and moving on if it affects only a couple of messages.

From 92efb4fc68724cfa20d5de48ba33b99c126a3a08 Mon Sep 17 00:00:00 2001
From: Gerhard Schlager
Date: Fri, 2 Oct 2020 17:27:39 +0200
Subject: [PATCH] Add timeout for parsing email in mbox importer

---
 script/import_scripts/mbox/support/indexer.rb | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/script/import_scripts/mbox/support/indexer.rb b/script/import_scripts/mbox/support/indexer.rb
index dc6e092c29..01523dad13 100644
--- a/script/import_scripts/mbox/support/indexer.rb
+++ b/script/import_scripts/mbox/support/indexer.rb
@@ -65,11 +65,15 @@ module ImportScripts::Mbox
     def index_emails(directory, category_name)
       all_messages(directory, category_name) do |receiver, filename, opts|
         begin
-          msg_id = receiver.message_id
-          parsed_email = receiver.mail
-          from_email, from_display_name = receiver.parse_from_field(parsed_email)
-          body, elided, format = receiver.select_body
-          reply_message_ids = extract_reply_message_ids(parsed_email)
+          msg_id = parsed_email = from_email = from_display_name = body = elided = format = reply_message_ids = nil
+
+          Timeout.timeout(60) do
+            msg_id = receiver.message_id
+            parsed_email = receiver.mail
+            from_email, from_display_name = receiver.parse_from_field(parsed_email)
+            body, elided, format = receiver.select_body
+            reply_message_ids = extract_reply_message_ids(parsed_email)
+          end
 
           email = {
             msg_id: msg_id,
-- 
2.28.0
2 Likes

Many thanks @gerhard, your patch is working like a dream. For my purposes, I think skipping the bad messages is okay since there are only a small amount, however we do now have additional output if it’s helpful to solve the issue or to make the importer script more robust:

Failed to index message in /shared/import/data/lammps-users/chunk_10.mbox at lines 726814-729353
execution expired
["/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogumbo-2.0.2/lib/nokogumbo/html5.rb:243:in `escape_text'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogumbo-2.0.2/lib/nokogumbo/html5.rb:214:in `serialize_node_internal'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogumbo-2.0.2/lib/nokogumbo/html5/node.rb:58:in `write_to'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node.rb:699:in `serialize'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node.rb:855:in `to_format'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node.rb:711:in `to_html'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogumbo-2.0.2/lib/nokogumbo/html5/node.rb:28:in `block in inner_html'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node_set.rb:238:in `block in each'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node_set.rb:237:in `upto'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node_set.rb:237:in `each'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogumbo-2.0.2/lib/nokogumbo/html5/node.rb:28:in `map'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogumbo-2.0.2/lib/nokogumbo/html5/node.rb:28:in `inner_html'",
"/var/www/discourse/lib/html_to_markdown.rb:74:in `block (2 levels) in hoist_line_breaks!'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node_set.rb:238:in `block in each'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node_set.rb:237:in `upto'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node_set.rb:237:in `each'",
"/var/www/discourse/lib/html_to_markdown.rb:57:in `block in hoist_line_breaks!'",
"/var/www/discourse/lib/html_to_markdown.rb:54:in `loop'",
"/var/www/discourse/lib/html_to_markdown.rb:54:in `hoist_line_breaks!'",
"/var/www/discourse/lib/html_to_markdown.rb:16:in `initialize'",
"/var/www/discourse/lib/email/receiver.rb:387:in `new'",
"/var/www/discourse/lib/email/receiver.rb:387:in `select_body'",
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:74:in `block (2 levels) in index_emails'", 
"/usr/local/lib/ruby/2.6.0/timeout.rb:108:in `timeout'",
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:70:in `block in index_emails'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:139:in `block (2 levels) in all_messages'",
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:171:in `block in each_mail'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:190:in `block in each_line'",
 "/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:189:in `each_line'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:189:in `each_line'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:166:in `each_mail'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:132:in `block in all_messages'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:125:in `foreach'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:125:in `all_messages'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:66:in `index_emails'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:25:in `block in execute'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:22:in `each'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:22:in `execute'", 
"/var/www/discourse/script/import_scripts/mbox/importer.rb:43:in `index_messages'", 
"/var/www/discourse/script/import_scripts/mbox/importer.rb:27:in `execute'", 
"/var/www/discourse/script/import_scripts/base.rb:47:in `perform'", 
"script/import_scripts/mbox.rb:12:in `<module:Mbox>'", 
"script/import_scripts/mbox.rb:10:in `<module:ImportScripts>'", 
"script/import_scripts/mbox.rb:9:in `<main>'"]

As previously, I can share the specific message if it’s helpful – this time the error message gives me the specific line numbers so we can at least have high confidence that we’ve identified the correct message.

3 Likes

Sure, please share the messages with me and will take a look. If there’s a problem we can fix it will not only improve the importer but Discourse itself as it uses the same parser for incoming emails.

1 Like

I’ve been running this script daily for the past several months for a site that really needs to switch to subscribing the category to the group, but that’s not done that. It works fine except every so often I need to get a new cookies.txt file. About a month ago, something happened and it started complaining that “It looks like you do not have permissions to see email addresses. Aborting.” I did . . . something . . . and it started working again. Just over a week ago it happened again and I’ve re-downloaded the cookies with multiple browsers/cookie plugins and keep getting the no-email version of the posts. I can see the addresses when logged in with the web browser.

Has anyone else had trouble lately? Any ideas on what to do? I’ve tried playing around with what domains are in the add_cookies call in the script, but that hasn’t helped.

1 Like