Mbox-experimental.rb import crashes with no implicit conversion of Regexp into Integer


(Yaw Anokwa) #1

I recently used mbox-experimental.rb (v1.9.0.beta1 +35) and it worked amazingly well. It’s a big improvement over the existing mbox.rb, so thanks to @gerhard for the great work there!

I only had one problem with the script and I’ve posted the stack trace and email message below. Removing the message allowed the script to continue, but it’d be nice if there as no crash (or the message was ignored or the user was asked if they wanted to stop or ignore this message).

Stack trace

/var/www/discourse/lib/email/receiver.rb:274:in `[]': no implicit conversion of Regexp into Integer (TypeError)
        from /var/www/discourse/lib/email/receiver.rb:274:in `parse_from_field'
        from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:60:in `block in index_emails'
        from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:104:in `block (2 levels) in all_messages'
        from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:143:in `each_mail'
        from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:103:in `block in all_messages'
        from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:96:in `foreach'
        from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:96:in `all_messages'
        from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:57:in `index_emails'
        from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:23:in `block in execute'
        from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:20:in `each'
        from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:20:in `execute'
        from /var/www/discourse/script/import_scripts/mbox/importer.rb:34:in `index_messages'
        from /var/www/discourse/script/import_scripts/mbox/importer.rb:25:in `execute'
        from /var/www/discourse/script/import_scripts/base.rb:45:in `perform'
        from mbox-experimental.rb:14:in `<module:Mbox>'
        from mbox-experimental.rb:8:in `<module:ImportScripts>'
        from mbox-experimental.rb:7:in `<main>'

Email message

X-Received: by 10.42.51.141 with SMTP id e13mr3117823icg.28.1394816204975;
        Fri, 14 Mar 2014 09:56:44 -0700 (PDT)
X-BeenThere: opendatakit@googlegroups.com
Received: by 10.182.55.73 with SMTP id q9ls345718obp.98.gmail; Fri, 14 Mar
 2014 09:56:42 -0700 (PDT)
X-Received: by 10.182.81.7 with SMTP id v7mr3655066obx.28.1394816202052;
        Fri, 14 Mar 2014 09:56:42 -0700 (PDT)
Received: by 10.50.6.16 with SMTP id w16msigw;
        Thu, 13 Mar 2014 14:21:09 -0700 (PDT)
X-Received: by 10.182.24.134 with SMTP id u6mr1633212obf.24.1394745669261;
        Thu, 13 Mar 2014 14:21:09 -0700 (PDT)
Return-Path: <quqapwgm@kgbgj.net>
Received: from kgbgj.net ([219.82.53.150])
        by gmr-mx.google.com with ESMTP id s1si1229168ign.1.2014.03.13.14.21.03
        for <opendatakit@googlegroups.com>;
        Thu, 13 Mar 2014 14:21:09 -0700 (PDT)
Received-SPF: neutral (google.com: 219.82.53.150 is neither permitted nor denied by best guess record for domain of quqapwgm@kgbgj.net) client-ip=219.82.53.150;
Authentication-Results: gmr-mx.google.com;
       spf=neutral (google.com: 219.82.53.150 is neither permitted nor denied by best guess record for domain of quqapwgm@kgbgj.net) smtp.mail=quqapwgm@kgbgj.net
Message-ID: <83D010B68028B8398B7E6F8BB8BF9A30@kgbgj.net>
From: =?utf-8?B?5bi46Z2S5rO9?=
To: <opendatakit@googlegroups.com>
Subject: =?utf-8?B?6YKT57uP55CG6ZmE5Lu26K+35p+l6ZiF77yb?=
Date: Fri, 14 Mar 2014 05:20:56 +0800
MIME-Version: 1.0
Content-Type: multipart/mixed;
	boundary="----=_NextPart_000_0423_01892B10.10703A20"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.5512
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5512

------=_NextPart_000_0423_01892B10.10703A20
Content-Type: text/plain;
	charset="utf-8"
Content-Transfer-Encoding: base64

5b6u5L+h5pyq5p2l5bCG5pS55Y+Y55qE5YWt5aSn6KGM5Lia77ybDQoNCuS4gOOAgeW5v+WRiuS4
muKAlOKAlOaJvuWbnumCo+a1qui0ueeahDUwJeW5v+WRiui0uQ0K5LqM44CB55S15ZWG5Lia4oCU
4oCU55S15ZWG572R56uZ5LiN5piv5bmz5Y+w5piv6LSn5p62DQrkuInjgIHlh7rniYjkuJrigJTi
gJTmlbDlrZflh7rniYjliLDoh6rlh7rniYjnmoTov5vljJYNCuWbm+OAgeW9semZoumkkOmlruS4
muKAlOKAlOe6v+S4i+acjeWKoeS4mueahE8yTw0K5LqU44CB6aKE5LuY5Y2hL+S8muWRmOWNoeKA
lOKAlENSTei/mOWPr+S7pei/meagt+eOqQ0K5YWt44CB6ZO26KGM5Z+66YeR5Lia4oCU4oCU5LqS
6IGU572R6YeR6J6N55qE6K+V6aqM55SwDQo=
------=_NextPart_000_0423_01892B10.10703A20
Content-Type: application/octet-stream;
	name="=?utf-8?B?6YKT57uP55CG6ZmE5Lu26K+35p+l6ZiFLmRvYw==?="
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
	filename="=?utf-8?B?6YKT57uP55CG6ZmE5Lu26K+35p+l6ZiFLmRvYw==?="

0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAABAAAAOQAAAAAAAAAA
EAAAOwAAAAEAAAD+////AAAAADgAAAD/////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////s

lots and lots of other random text

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==

------=_NextPart_000_0423_01892B10.10703A20--

(Joffrey Jaffeux) #2

Your email From field is some utf8 base64 encoded. It’s making this regex fails as it doesn’t expect this kind of format: discourse/receiver.rb at v1.9.0.beta1 · discourse/discourse · GitHub

Here is a quick code in ruby that could decode your email “From” field and help with the issue, I have no knowledge of the whole importer, so I just leave it for someone more involved:

require "base64"
encoded = "=?UTF-8?B?5bi46Z2S5rO9?="
regex = /\?utf-8\?B\?(.*)\?=/i
captures = encoded.match(regex).captures

if captures
  text = captures[0]
  decoded = Base64.decode64(text)
  p decoded.force_encoding("utf-8")  # "常青泽"
end

(Eli the Bearded) #3

Well, besides the MIME encoded word, the from line does not contain an email address, which I think is the real problem that triggered here. Given that this is a Google Groups source message, I would guess that Usenet rules for From: headers apply. Digging out RFC1036, it shows these three forms as the only acceptable ones:

[quote=“RFC1036 page 3”]```
From: mark@cbosgd.ATT.COM
From: mark@cbosgd.ATT.COM (Mark Horton)
From: Mark Horton mark@cbosgd.ATT.COM


This parser appears to only expect the third (and most modern) format.

(Yaw Anokwa) #4

Elijay is correct! I grepped all the emails with grep -rh "^From: =?utf-8?" . and found this…

Import success

From: =?utf-8?Q?Charles_Mout=C3=A9?= <foo@gmail.com>
From: =?utf-8?B?5oi05ai8?= <bar@pub.xaonline.com>
From: =?utf-8?Q?Nguy=E1=BB=85n_V=C4=83n_Thanh?= <foo.bar@gmail.com>

Import failure

From: =?utf-8?B?5bi46Z2S5rO9?=

I noticed the thread was moved to support. Feels like a bug to me. Or rather, it should fail in a nicer way. Maybe the importer should ignore the message and move on?