Experimental mbox importer fails on subset of messages


(M K) #1

I am using the experimental mbox importer to import a folder consisting of split-up mbox files created using:

formail -k -X Content-Type: -X Message-ID: -X Date: -X From: -X Subject: -X In-Reply-To: -ds sh -c 'cat > split/msg.$FILENO' < 1997

This is the output:

discourse@vultr:~/discourse/script/import_scripts$ bundle exec ruby mbox-experimental.rb ./mbox/settings.yml
loading existing groups...
loading existing users...
loading existing categories...
loading existing posts...
loading existing topics...

creating index
indexing files in /home/discourse/single/split
indexing /home/discourse/single/split/msg.1497
indexing /home/discourse/single/split/msg.3896
indexing /home/discourse/single/split/msg.1426
indexing /home/discourse/single/split/msg.3339
indexing /home/discourse/single/split/msg.2750

It freezes on msg.2750. If this message is removed from the folder and the importer is restarted, the importer starts indexing files again until freezing again. I have a included a link to an archived folder containing a collection of such problematic messages.

Do these files have anything in common – do they either contain something that they shouldn’t, or lack a header field that they should have?

This is the output when Ctrl + C is pressed when the indexing has halted:

/home/discourse/.rbenv/versions/2.3.4/lib/ruby/gems/2.3.0/gems/email_reply_trimmer-0.1.7/lib/email_reply_trimmer.rb:182:in `gsub!': Interrupt
        from /home/discourse/.rbenv/versions/2.3.4/lib/ruby/gems/2.3.0/gems/email_reply_trimmer-0.1.7/lib/email_reply_trimmer.rb:182:in `block in preprocess!'
        from /home/discourse/.rbenv/versions/2.3.4/lib/ruby/gems/2.3.0/gems/email_reply_trimmer-0.1.7/lib/email_reply_trimmer.rb:181:in `each'
        from /home/discourse/.rbenv/versions/2.3.4/lib/ruby/gems/2.3.0/gems/email_reply_trimmer-0.1.7/lib/email_reply_trimmer.rb:181:in `preprocess!'
        from /home/discourse/.rbenv/versions/2.3.4/lib/ruby/gems/2.3.0/gems/email_reply_trimmer-0.1.7/lib/email_reply_trimmer.rb:33:in `trim'
        from /home/discourse/discourse/lib/email/receiver.rb:205:in `select_body'
        from /home/discourse/discourse/script/import_scripts/mbox/support/indexer.rb:61:in `block in index_emails'
        from /home/discourse/discourse/script/import_scripts/mbox/support/indexer.rb:104:in `block (2 levels) in all_messages'
        from /home/discourse/discourse/script/import_scripts/mbox/support/indexer.rb:143:in `each_mail'
        from /home/discourse/discourse/script/import_scripts/mbox/support/indexer.rb:103:in `block in all_messages'
        from /home/discourse/discourse/script/import_scripts/mbox/support/indexer.rb:96:in `foreach'
        from /home/discourse/discourse/script/import_scripts/mbox/support/indexer.rb:96:in `all_messages'
        from /home/discourse/discourse/script/import_scripts/mbox/support/indexer.rb:57:in `index_emails'
        from /home/discourse/discourse/script/import_scripts/mbox/support/indexer.rb:23:in `block in execute'
        from /home/discourse/discourse/script/import_scripts/mbox/support/indexer.rb:20:in `each'
        from /home/discourse/discourse/script/import_scripts/mbox/support/indexer.rb:20:in `execute'
        from /home/discourse/discourse/script/import_scripts/mbox/importer.rb:34:in `index_messages'
        from /home/discourse/discourse/script/import_scripts/mbox/importer.rb:25:in `execute'
        from /home/discourse/discourse/script/import_scripts/base.rb:45:in `perform'
        from mbox-experimental.rb:14:in `<module:Mbox>'
        from mbox-experimental.rb:8:in `<module:ImportScripts>'
        from mbox-experimental.rb:7:in `<main>'

mboxfails.tar.gz (52.5 KB)


Importing from Google Groups
(Gerhard Schlager) #2

Well, I named it “experimental” for a reason :wink:
I’m not sure why the indexing freezes on those files. They look fine on first glance…

Thanks for the files. I expect I’ll work on the importer a little bit more soon. I’ll look into it then.