Importing / migrating mailing lists (mbox, Listserv, Google Groups, emails, ...)

I’m trying to import a standard mbox dump from a mailing list but running into “Process killed” issues, typically after a long time spent on the “indexing […]mbox” step. This is from a large mbox file from an open-source project with ten years of posts.

Things I’ve tried:

  • Splitting the mbox file into chunks. This did work partially, and successfully imported many posts, but I’m now stuck on the indexing of one of these chunks. I tried to divide that file into chunks too, the first of which imported eventually, and the second of which now seems stalled.

  • Increasing the memory available on our server. Memory usage slowly increases during indexing and currently plateaus at around 16 GB (out of 32 GB) for an attempted import of one of these chunks, a 80 MB mbox file:

During this time, 1 CPU continues to be maxed out.

Any advice would be much appreciated, in particular increasing verbosity of debug output if maybe it’s getting stuck on a particular post. The index.db file in the import folder is about 800 MB.

I’m new to Ruby and do not regularly use SQL, so I’m finding it difficult to figure out what’s going on. Also this 32 GB server is expensive so I’d like to resize it back to 4 GB soon :slight_smile:

Thanks for any help!

1 Like

I guess the parser hangs on one particular email in that mbox. The index.db is an SQLite database. Take a look at the email table, filter by the mbox filename in the filename column and find the highest value within the last_line_number column. It’s highly likely that the parser hangs at the next email after that line number within the mbox file.

2 Likes

Many thanks @gerhard, I’ve managed to identify the last successfully indexed email and the email immediately afterwards that (I assume) is causing the hang. However, there doesn’t seem to be anything exceptional about these emails to me. Is it ok if I send you these two example emails in a private message to see if anything stands out?

Sure, you can send me a PM. And try removing those emails from the mbox file and test if the indexing works.

2 Likes

Thanks, sent. I wasn’t able to PM yourself directly so sent it via the team group, I hope this is ok. I will try removing the email too and see how far along we get before another hang.

1 Like

Thanks for the emails. I didn’t see anything out of the ordinary and in my tests it worked without problems.

This is untested, but you could try to apply the following git patch before running the import script. It adds a 60 second timeout for parsing an email. That might help you in finding the culprit and moving on if it affects only a couple of messages.

From 92efb4fc68724cfa20d5de48ba33b99c126a3a08 Mon Sep 17 00:00:00 2001
From: Gerhard Schlager
Date: Fri, 2 Oct 2020 17:27:39 +0200
Subject: [PATCH] Add timeout for parsing email in mbox importer

---
 script/import_scripts/mbox/support/indexer.rb | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/script/import_scripts/mbox/support/indexer.rb b/script/import_scripts/mbox/support/indexer.rb
index dc6e092c29..01523dad13 100644
--- a/script/import_scripts/mbox/support/indexer.rb
+++ b/script/import_scripts/mbox/support/indexer.rb
@@ -65,11 +65,15 @@ module ImportScripts::Mbox
     def index_emails(directory, category_name)
       all_messages(directory, category_name) do |receiver, filename, opts|
         begin
-          msg_id = receiver.message_id
-          parsed_email = receiver.mail
-          from_email, from_display_name = receiver.parse_from_field(parsed_email)
-          body, elided, format = receiver.select_body
-          reply_message_ids = extract_reply_message_ids(parsed_email)
+          msg_id = parsed_email = from_email = from_display_name = body = elided = format = reply_message_ids = nil
+
+          Timeout.timeout(60) do
+            msg_id = receiver.message_id
+            parsed_email = receiver.mail
+            from_email, from_display_name = receiver.parse_from_field(parsed_email)
+            body, elided, format = receiver.select_body
+            reply_message_ids = extract_reply_message_ids(parsed_email)
+          end
 
           email = {
             msg_id: msg_id,
-- 
2.28.0
2 Likes

Many thanks @gerhard, your patch is working like a dream. For my purposes, I think skipping the bad messages is okay since there are only a small amount, however we do now have additional output if it’s helpful to solve the issue or to make the importer script more robust:

Failed to index message in /shared/import/data/lammps-users/chunk_10.mbox at lines 726814-729353
execution expired
["/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogumbo-2.0.2/lib/nokogumbo/html5.rb:243:in `escape_text'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogumbo-2.0.2/lib/nokogumbo/html5.rb:214:in `serialize_node_internal'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogumbo-2.0.2/lib/nokogumbo/html5/node.rb:58:in `write_to'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node.rb:699:in `serialize'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node.rb:855:in `to_format'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node.rb:711:in `to_html'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogumbo-2.0.2/lib/nokogumbo/html5/node.rb:28:in `block in inner_html'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node_set.rb:238:in `block in each'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node_set.rb:237:in `upto'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node_set.rb:237:in `each'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogumbo-2.0.2/lib/nokogumbo/html5/node.rb:28:in `map'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogumbo-2.0.2/lib/nokogumbo/html5/node.rb:28:in `inner_html'",
"/var/www/discourse/lib/html_to_markdown.rb:74:in `block (2 levels) in hoist_line_breaks!'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node_set.rb:238:in `block in each'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node_set.rb:237:in `upto'",
"/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/nokogiri-1.10.10/lib/nokogiri/xml/node_set.rb:237:in `each'",
"/var/www/discourse/lib/html_to_markdown.rb:57:in `block in hoist_line_breaks!'",
"/var/www/discourse/lib/html_to_markdown.rb:54:in `loop'",
"/var/www/discourse/lib/html_to_markdown.rb:54:in `hoist_line_breaks!'",
"/var/www/discourse/lib/html_to_markdown.rb:16:in `initialize'",
"/var/www/discourse/lib/email/receiver.rb:387:in `new'",
"/var/www/discourse/lib/email/receiver.rb:387:in `select_body'",
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:74:in `block (2 levels) in index_emails'", 
"/usr/local/lib/ruby/2.6.0/timeout.rb:108:in `timeout'",
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:70:in `block in index_emails'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:139:in `block (2 levels) in all_messages'",
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:171:in `block in each_mail'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:190:in `block in each_line'",
 "/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:189:in `each_line'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:189:in `each_line'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:166:in `each_mail'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:132:in `block in all_messages'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:125:in `foreach'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:125:in `all_messages'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:66:in `index_emails'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:25:in `block in execute'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:22:in `each'", 
"/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:22:in `execute'", 
"/var/www/discourse/script/import_scripts/mbox/importer.rb:43:in `index_messages'", 
"/var/www/discourse/script/import_scripts/mbox/importer.rb:27:in `execute'", 
"/var/www/discourse/script/import_scripts/base.rb:47:in `perform'", 
"script/import_scripts/mbox.rb:12:in `<module:Mbox>'", 
"script/import_scripts/mbox.rb:10:in `<module:ImportScripts>'", 
"script/import_scripts/mbox.rb:9:in `<main>'"]

As previously, I can share the specific message if it’s helpful – this time the error message gives me the specific line numbers so we can at least have high confidence that we’ve identified the correct message.

3 Likes

Sure, please share the messages with me and will take a look. If there’s a problem we can fix it will not only improve the importer but Discourse itself as it uses the same parser for incoming emails.

1 Like

I’ve been running this script daily for the past several months for a site that really needs to switch to subscribing the category to the group, but that’s not done that. It works fine except every so often I need to get a new cookies.txt file. About a month ago, something happened and it started complaining that “It looks like you do not have permissions to see email addresses. Aborting.” I did . . . something . . . and it started working again. Just over a week ago it happened again and I’ve re-downloaded the cookies with multiple browsers/cookie plugins and keep getting the no-email version of the posts. I can see the addresses when logged in with the web browser.

Has anyone else had trouble lately? Any ideas on what to do? I’ve tried playing around with what domains are in the add_cookies call in the script, but that hasn’t helped.

1 Like

Well, I’m looking at this again and it appears that links like

 https://groups.google.com/forum/message/raw?msg=GROUP_NAME/THREAD_ID/POST_ID

used to include the full email addresses, but now doesn’t. I can confirm that when I’m logged in, I can click about and see full email addresses in the google groups web interface, but if I hit the above URL that the scrape script is hitting in the very same web browser, it gets the data with redacted email addresses.

My guess is that they’ve increased privacy or something.

Here’s another clue: I can open that link in my browser and it works, but if I snag the “copy as cURL” that curl command isn’t getting the email addresses. Sigh. Well, I tried with another browser and the curl command worked. I can’t quite figure why the script isn’t getting the email addresses.

So maybe there’s some other browser-specific thing it’s doing?

I haven’t tried recently, so it’s possible that there have been changes that the scraper can’t handle right now.

@riking noticed that Google Takeout exports mbox files for group owners, so that might be an option to check out.

image

4 Likes

Thanks. Well, a week ago it worked for a second site and then I updated my cookies file one more time and it downloaded the data for the first site. It seems to have only worked for a day or two, and now again it’s not working. For either site. I see the full email address in my browser, download the cookies for that tab, and no joy.

I’ll check out the takout. EDIT: Well, to get the mbox file it looks like you need to be a super administrator, not just an owner.

3 Likes

A command line tool to convert a mailman2 mailing list (i.e. the content of the config.pck with options, members, moderators, private or public flags etc.) into a discourse category is available here: mailman2discourse · PyPI

1 Like

@gerhard any ideas how to adapt these instructions to using a dev install rather than the standard installation? I feel like I’ve come close to getting a listserv migration working using just a few commands, but I can’t get what I assume is the last step to work using either of:

ruby /src/script/import_scripts/mbox.rb ~/import/settings.yml
bundle exec ruby /src/script/import_scripts/mbox.rb /home/discourse/import/settings.yml

Both fail to pull in all dependencies. See here for the full set of commands I used and the errors. Any ideas? Missing some d/bundle calls perhaps?

Next thing I will try is using an Ubuntu VM and doing a “standard install” there but this seems a bit overkill given the dev install otherwise works quite nicely.

I am a total discourse (and ruby, and mostly docker) newbie, so sorry if this is obvious or (worse) irrelevant!

2 Likes

It looks like you figured out most of it.

I’ve never tried this with the Docker based dev install, but I guess you need to add "gem 'sqlite3'" to the Gemfile and execute apt install -y libsqlite3-dev inside the container before running d/bundle install

Afterwards bundle exec ruby ... should work.

2 Likes

@gerhard thanks for the gentle nudge – I re-ran from absolute scratch (git clone onward) while adding gem 'sqlite3' to the end of /src/Gemfile, since I assumed this was the one you meant, and it worked! For the record here are the instructions I used (for mne_analysis listserv):

1. In Ubuntu host

git clone https://github.com/discourse/discourse.git
cd discourse
d/boot_dev --init
d/rails db:migrate RAILS_ENV=development
d/shell
vim /src/Gemfile  # add gem 'sqlite3' to end
exit
d/bundle

2. In docker shell

sudo mkdir -p /shared/import/data
sudo chown -R discourse:discourse /shared/import
wget -r -l1 --no-parent --no-directories "https://mail.nmr.mgh.harvard.edu/pipermail//mne_analysis/" -P /shared/import/data/mne_analysis -A "*-*.txt.gz"
rm /shared/import/data/mne_analysis/robots.txt.tmp
gzip -d /shared/import/data/mne_analysis/*.txt.gz
wget https://gist.githubusercontent.com/larsoner/940cd6c7100b87c4c5668cb0bc540afb/raw/9e78513620d11355ad0e10f4a2470996c26ebc8c/mailmanToMBox.py -O ~/mailmanToMBox.py
python3 ~/mailmanToMBox.py /shared/import/data/mne_analysis/
rm /shared/import/data/mne_analysis/*.txt
sudo apt install -y libsqlite3-dev  # no-op for me

# check results
cat /shared/import/data/mne_analysis/*.mbox > ~/all.mbox
sudo apt install -y procmail
mkdir -p ~/split
export FILENO=0000
formail -ds sh -c 'cat > ~/split/msg.$FILENO' < ~/all.mbox
rm -rf ~/split ~/all.mbox

# settings
wget https://raw.githubusercontent.com/discourse/discourse/master/script/import_scripts/mbox/settings.yml -O /shared/import/settings.yml

# run it
cd /src
bundle exec ruby script/import_scripts/mbox.rb /shared/import/settings.yml

This had a bunch of informative output, and at the end:

...
Updating featured topics in categories
        5 / 5 (100.0%)  [6890 items/min]   ]  
Resetting topic counters


Done (00h 06min 21sec)

Then exiting and on the Ubuntu host:

d/unicorn &
google-chrome http://0.0.0.0:9292

Done!

I’ll probably tweak the settings to get rid of the [Mne_analysis] prefix, but I’m thrilled it’s working this well already!

3 Likes

@gerhard Can your mbox importer be used only when you first install Discourse or can it be used later after other users are using Discourse? If the importer is used when Discourse is being used by others will they see any side effects?

1 Like

In order to get the importer to scrape messages from Google Groups, I had to reverse this change in /script/import_scripts/google_groups.rb

I put the line

    wait_for_url { |url| url.start_with?("https://accounts.google.com") }

back to

    wait_for_url { |url| url.start_with?("https://myaccount.google.com") }

Otherwise, I would get this message every time:

Logging in...
Failed to login. Please check the content of your cookies.txt
5 Likes

@gerhard I noticed after the import that although the messages look okay, there are no staged users at all, even though it seems like there should be (I used the default staged: true). The outputs look like:

...
indexing replies and users

creating categories
        1 / 1 (100.0%)  [13440860 items/min]  
creating users

creating topics and posts
     7399 / 7399 (100.0%)  [1421 items/min]     
...

Is there supposed to be a user counter shown as well?

I also tried running with staged: false and the same output was shown, and none of the mailing list users are in any groups. In case it helps to see what’s actually being processed, here is one of the many .mbox files that is being imported:

2020-December.zip (49.5 KB)

The only non-default setting was adding:

tags:
  "Mne_analysis": "mne_analysis"

It would be great to have these users show up as staged so that they can claim their old posts when signing up, so any tips or ideas appreciated!

1 Like

It should probably just accept both of those?