Import_mbox.sh not working with e-mails from Samsung phone sent via a listserv server

markcoley · January 20, 2022, 11:04am

A strange problem has come to light in my testing setup where I am copying e-mails over to my discourse server and running the import_mbox.sh to incorporate those e-mails. The original e-mails are from from a listserv mailing list.

I’ve found that if people are using Samsung phones, and replying to a previous listserv e-mail, if I try to import that resulting e-mail into discourse it doesn’t extract the new content but just puts up a duplicate of the original e-mail but labelled as if the person who replied has written it.

If I copy/paste the raw e-mail that is problematic into the Emails/Advanced Test box the same issue is present. If I truncate the e-mail and strip out several Samsung-added parts it seems to work.

I can’t put copies of the e-mails that trigger this here as they are confidential. E-mails that don’t import have sections like this in them (and there is no human-readable content - it’s all in base64 coding):

[truncated headers here]
Content-Type: multipart/alternative;
	boundary="--_com.samsung.android.email_341310020171250"

----_com.samsung.android.email_341310020171250
Content-Transfer-Encoding: base64
Content-Type: text/plain; charset=UTF-8

VGhlIGxlZ2lzbGF0aW9uI[truncated]
[...]
[truncated]X19fX19fX19fX18NCg==
----_com.samsung.android.email_341310020171250
Content-Type: multipart/related;
	boundary="--_com.samsung.android.email_341310031317791"

pfaffman · January 20, 2022, 1:18pm

So you’ll need to modify import_mbox.sh to truncate the email and strip out the Sammsung nonsense.

It could be an issue that could be resolved in core, as those messages probably fail when processed by emailing them in (but I haven’t looked at the code lately, so I don’t know). In any case, the most expedient solution will likely be to modify the import script for those messages.

Or maybe someone will recognize this as a problem in core and fix it.

markcoley · January 20, 2022, 1:56pm

Having done a bit more delving it seems the Samsung mail app encodes a plain text and an HTML part, each coded in base64. I’ve found if I add a blank line between the two encodings then the mail filter works correctly. It may be Samsung is not adding a blank line where it should, or it may be the mail filter isn’t correctly locating the plain text/HTML text part and not realising that once it’s found the HTML part it knows where the header of that finishes and the message content starts.

I’ve tried copying the original e-mail from Gmail (via view original) and also exporting the same message from Thunderbird, with the same results.

Samsung-generated e-mails seem to have this at the bottom of the headers:

Content-Type: multipart/alternative;
	boundary="--_com.samsung.android.email_396413402758380"

----_com.samsung.android.email_396413402758380
Content-Transfer-Encoding: base64
Content-Type: text/plain; charset=UTF-8

WWVz[base64 encoded plain text message goes here]

and this ends with

[more base64 encoded data here]19fDQo=
----_com.samsung.android.email_396413402758380
Content-Transfer-Encoding: base64
Content-Type: text/html; charset=UTF-8

PGh0b[base64 encoding again, this time encoding HTML version of the same message]

and this ends with

[more base64 encoded data]NCg==
----_com.samsung.android.email_396413402758380--

Now if I change the middle bit by adding a blank line (after the “email_396413402758380” bit), all works perfectly!

[more base64 encoded data here]19fDQo=
----_com.samsung.android.email_396413402758380

Content-Transfer-Encoding: base64
Content-Type: text/html; charset=UTF-8

PGh0b[base64 encoding again, this time encoding HTML version of the same message]

Does this suggest a bug in the importer?

pfaffman · January 20, 2022, 2:01pm

To me it suggests a bug in Samsung’s mailer. But I don’t know, and it doesn’t matter whose bug it is.

However, the easiest fix will be to add a gsub in the import script that adds the blank line you describe.

It’s probably easier to insert a line before Content-Transfer-Encoding: base64

Hopefully that won’t break anything else.

But Gerhard is writing a better answer… right now

gerhard · January 20, 2022, 2:09pm

Well, in that case I’d say it’s either a bug in the mail gem we use for parsing emails or a bug in the Samsung app. After a quick glance at the RFCs I’d say it’s probably a bug in the parser.

Could you by any chance provide a full example of such a problematic email? Maybe you could ask one of the authors of your confidential emails to send you a non-confidential email?

markcoley · January 20, 2022, 3:36pm

I’ve tried to contrive an e-mail by decoding the base64, changing the wording, then re-encoding and have found something else interesting.

The removal of a space character part way through the original message can make it correctly extract the reply that was written above.

In this example, in the middle of the base64 encoded HTML message if I find a line containing a [space] before a slash div and remove it, so change

21  20:17  (GMT+00:00) </div><div>To: LIST@LISTS

to

21  20:17  (GMT+00:00)</div><div>To: LIST@LISTS

through the removal of the [space] character before the /div, then re-encode to base64 and put in back in the message testing box in the admin settings then the filter works.

I could post an e-mail via direct message if any help?

markcoley · January 20, 2022, 10:45pm

Here is a contrived e-mail I’ve made that I think demonstrates the problem. If you look at the HTML part it has a reply to an earlier message. The importer doesn’t seem to be able to see where the original message started.

From someone@gmail
Delivered-To: someone@gmail
Importance: Normal
MIME-Version: 1.0
Message-ID: <E1mt6gg-00H2OV-6N@relay01.mail.eu.clara.net>
Date: Fri, 3 Dec 2021 11:25:05 +0000
From: someone <someone@somewhere.net>
Subject: Re: Example e-mail
To: <LIST@LISTSERV>
In-Reply-To: <007301d7e834$c268a3e0$4739eba0$@sslmc.co.uk>
Precedence: list
Content-Type: multipart/alternative;
	boundary="--_com.samsung.android.email_7076959834053910"

----_com.samsung.android.email_7076959834053910
Content-Transfer-Encoding: base64
Content-Type: text/plain; charset=UTF-8

WWVzIGFuZCB3ZSBhcmUgZ2V0dGluZyBsb3RzIGluIEFCQyBhbmQgdXJnZW50IGNhcmUsIGFsb25n
IHdpdGggdmFjY2luYXRpb24gc2lkZSBlZmZlY3RzIGZyb20gZmx1ICsgYm9vc3Rlci4gQXBwYXJl
bnRseSBERUYxMTEgY2FuJ3QgZGVhbCB3aXRoIHRoaXMgc29ydCBvZiBxdWVyeS4gTG9va2luZyBn
b29kIGZvciBYbWFzIGFuZCBOWSB3ZWVrICAgIDIyMjIKClNhbQoKRHIgU2FtIFNtaXRoeSAKCgot
LS0tLS0tLSBPcmlnaW5hbCBtZXNzYWdlIC0tLS0tLS0tCkZyb206IFNvdXRoIFNvdXRocyBYWVog
PGVucXVpcnlAU1NYWVouQ08uVUs+CkRhdGU6IDAzLzEyLzIwMjEgMTA6NTkgKEdNVCswMDowMCkK
VG86IExJU1RTQExJU1RTRVJWLkFCQy5PUkcuVUsKU3ViamVjdDogZXZlcnl0aGluZyBsYW5kcyBi
YWNrIGF0IG91ciBkb29yIQoKUHJhY3RpY2VzIHJlcG9ydGluZyB0byBnZXQgIDIgb3IgMyBxdWVy
aWVzL2RheSBmcm9tIHBhdGllbnRzIHJlZ2FyZGluZyBmaXZlcyB2YWNjaW5lIGlzc3VlcyB3aG8g
aGF2ZSBiZWVuIHJlZmVycmVkIHRvIHRoZWlyIEdQIGJ5IFhZWiAxMjMgb3IgdGhlIE5CUy4gIFRo
ZXNlIHF1ZXJpZXMgaW5jbHVkZSBzY2hlZHVsaW5nIGRvc2VzIGZvciBpbW11bm9zdXBwcmVzc2Vk
IHB0cyBhcyB3ZWxsIGFzIHBoYXJtYWNldXRpY2FsIHF1ZXN0aW9ucy4gIA==
----_com.samsung.android.email_7076959834053910
Content-Transfer-Encoding: base64
Content-Type: text/html; charset=UTF-8

PGh0bWw+PGhlYWQ+PG1ldGEgaHR0cC1lcXVpdj0iQ29udGVudC1UeXBlIiBjb250ZW50PSJ0ZXh0
L2h0bWw7IGNoYXJzZXQ9VVRGLTgiPjwvaGVhZD48Ym9keSBkaXI9ImF1dG8iPjxkaXYgZGlyPSJh
dXRvIj5ZZXMgYW5kIHdlIGFyZSBnZXR0aW5nIGxvdHMgaW4gQUJDIGFuZCB1cmdlbnQgY2FyZSwg
YWxvbmcgd2l0aCB2YWNjaW5hdGlvbiBzaWRlIGVmZmVjdHMgZnJvbSBmbHUgKyBib29zdGVyLiBB
cHBhcmVudGx5IERFRjExMSBjYW4ndCBkZWFsIHdpdGggdGhpcyBzb3J0IG9mIHF1ZXJ5LiBMb29r
aW5nIGdvb2QgZm9yIFhtYXMgYW5kIE5ZIHdlZWsmbmJzcDsgJm5ic3A7IDIyMjI8L2Rpdj48ZGl2
IGRpcj0iYXV0byI+PGJyPjwvZGl2PjxkaXYgZGlyPSJhdXRvIj5TYW08L2Rpdj48ZGl2IGRpcj0i
YXV0byI+PGJyPjwvZGl2PjxkaXYgaWQ9ImNvbXBvc2VyX3NpZ25hdHVyZSIgZGlyPSJhdXRvIj48
bWV0YSBodHRwLWVxdWl2PSJDb250ZW50LVR5cGUiIGNvbnRlbnQ9InRleHQvaHRtbDsgY2hhcnNl
dD1VVEYtOCI+RHIgU2FtIFNtaXRoeSZuYnNwOzwvZGl2PjxkaXYgZGlyPSJhdXRvIj48YnI+PC9k
aXY+PGRpdj48YnI+PC9kaXY+PGRpdiBhbGlnbj0ibGVmdCIgZGlyPSJhdXRvIiBzdHlsZT0iZm9u
dC1zaXplOjEwMCU7Y29sb3I6IzAwMDAwMCI+PGRpdj4tLS0tLS0tLSBPcmlnaW5hbCBtZXNzYWdl
IC0tLS0tLS0tPC9kaXY+PGRpdj5Gcm9tOiBTb3V0aCBTb3V0aHMgWFlaICZsdDtlbnF1aXJ5QFNT
WFlaLkNPLlVLJmd0OyA8L2Rpdj48ZGl2PkRhdGU6IDAzLzEyLzIwMjEgIDEwOjU5ICAoR01UKzAw
OjAwKSA8L2Rpdj48ZGl2PlRvOiBMSVNUU0BMSVNUU0VSVi5BQkMuT1JHLlVLIDwvZGl2PjxkaXY+
U3ViamVjdDogZXZlcnl0aGluZyBsYW5kcyBiYWNrIGF0IG91ciBkb29yISA8L2Rpdj48ZGl2Pjxi
cj48L2Rpdj48L2Rpdj48ZGl2IGNsYXNzPSJXb3JkU2VjdGlvbjEiPjxwIGNsYXNzPSJNc29Ob3Jt
YWwiPjxzcGFuIHN0eWxlPSJmb250LWZhbWlseTomcXVvdDtBcmlhbCZxdW90OywmcXVvdDtzYW5z
LXNlcmlmJnF1b3Q7Ij5QcmFjdGljZXMgcmVwb3J0aW5nIHRvIGdldCAmbmJzcDsyIG9yIDMgcXVl
cmllcy9kYXkgZnJvbSBwYXRpZW50cyByZWdhcmRpbmcgZml2ZXMgdmFjY2luZSBpc3N1ZXMgd2hv
IGhhdmUgYmVlbiByZWZlcnJlZCB0byB0aGVpciBHUCBieSBYWVogMTIzIG9yIHRoZSBOQlMuJm5i
c3A7IFRoZXNlIHF1ZXJpZXMgaW5jbHVkZSBzY2hlZHVsaW5nIGRvc2VzIGZvciBpbW11bm9zdXBw
cmVzc2VkIHB0cyBhcyB3ZWxsIGFzIHBoYXJtYWNldXRpY2FsIHF1ZXN0aW9ucy4mbmJzcDsgPC9z
cGFuPjwvcD48L2Rpdj48L2JvZHk+PC9odG1sPg==
----_com.samsung.android.email_7076959834053910--

markcoley · May 7, 2022, 9:11pm

This problem seems to affect messages from other mail clients too I’m now discovering. I can’t post in public the e-mails that generate the faults for everyone to look at but would be happy to let someone see them privately.

My current set up is that I have installed Discourse on a home server, e-mails to a listserv mailing list are sent to me (which goes to a Gmail account). If a ‘To:’ filter matches the name of the mailing list I have set Gmail to forward a copy of the e-mail to mailinglist@mydiscoursedomain.org.uk. Discourse has a category set to mirror a mailing list that looks for this e-mail.

The same issue comes if I use the import_mbox.sh script too having manually copied e-mails over, so it must be the part of the code that looks for the new part of the message that is getting confused.

markcoley · May 9, 2022, 10:09am

Is there any way to make Discourse whizz through all the previously stored imported e-mails and try reformatting them using the plain text part of the original e-mails in case that is a temporary fix to the above problem? Before import it was set to use the HTML part. From peeking around using ‘rails c’ I can see each post seems to have the full text of the incoming messages stored (including e-mail headers). I’ve tried running the ‘rake posts:rebuild’ after turning off the HTML option and whilst it plods through all the messages slowly I’m not sure if anything changed, eg I tried turning on and off the show trimmed content option too but the little box with three dots still seems to be there on posts after the rake has finished.

Topic		Replies	Views
Migrate a mailing list to Discourse (mbox, Listserv, Google Groups, etc) Migrating to Discourse how-to	106	23599	January 22, 2025
Text of forwarded emails don't show up in posts Feature	28	8489	November 9, 2019
Importing mbox files fails at creating topics with Chinese characters due to invalid byte sequence Bug	6	1450	October 31, 2018
Yahoo Groups Importation Errors Migration	7	1357	January 18, 2020
Proposed plugin to improve reply-by-email accuracy Feature	2	708	December 26, 2023

Import_mbox.sh not working with e-mails from Samsung phone sent via a listserv server

Related topics