A strange problem has come to light in my testing setup where I am copying e-mails over to my discourse server and running the import_mbox.sh to incorporate those e-mails. The original e-mails are from from a listserv mailing list.
I’ve found that if people are using Samsung phones, and replying to a previous listserv e-mail, if I try to import that resulting e-mail into discourse it doesn’t extract the new content but just puts up a duplicate of the original e-mail but labelled as if the person who replied has written it.
If I copy/paste the raw e-mail that is problematic into the Emails/Advanced Test box the same issue is present. If I truncate the e-mail and strip out several Samsung-added parts it seems to work.
I can’t put copies of the e-mails that trigger this here as they are confidential. E-mails that don’t import have sections like this in them (and there is no human-readable content - it’s all in base64 coding):
So you’ll need to modify import_mbox.sh to truncate the email and strip out the Sammsung nonsense.
It could be an issue that could be resolved in core, as those messages probably fail when processed by emailing them in (but I haven’t looked at the code lately, so I don’t know). In any case, the most expedient solution will likely be to modify the import script for those messages.
Or maybe someone will recognize this as a problem in core and fix it.
Having done a bit more delving it seems the Samsung mail app encodes a plain text and an HTML part, each coded in base64. I’ve found if I add a blank line between the two encodings then the mail filter works correctly. It may be Samsung is not adding a blank line where it should, or it may be the mail filter isn’t correctly locating the plain text/HTML text part and not realising that once it’s found the HTML part it knows where the header of that finishes and the message content starts.
I’ve tried copying the original e-mail from Gmail (via view original) and also exporting the same message from Thunderbird, with the same results.
Samsung-generated e-mails seem to have this at the bottom of the headers:
[more base64 encoded data here]19fDQo=
----_com.samsung.android.email_396413402758380
Content-Transfer-Encoding: base64
Content-Type: text/html; charset=UTF-8
PGh0b[base64 encoding again, this time encoding HTML version of the same message]
Now if I change the middle bit by adding a blank line (after the “email_396413402758380” bit), all works perfectly!
[more base64 encoded data here]19fDQo=
----_com.samsung.android.email_396413402758380
Content-Transfer-Encoding: base64
Content-Type: text/html; charset=UTF-8
PGh0b[base64 encoding again, this time encoding HTML version of the same message]
Well, in that case I’d say it’s either a bug in the mail gem we use for parsing emails or a bug in the Samsung app. After a quick glance at the RFCs I’d say it’s probably a bug in the parser.
Could you by any chance provide a full example of such a problematic email? Maybe you could ask one of the authors of your confidential emails to send you a non-confidential email?
I’ve tried to contrive an e-mail by decoding the base64, changing the wording, then re-encoding and have found something else interesting.
The removal of a space character part way through the original message can make it correctly extract the reply that was written above.
In this example, in the middle of the base64 encoded HTML message if I find a line containing a [space] before a slash div and remove it, so change
21 20:17 (GMT+00:00) </div><div>To: LIST@LISTS
to
21 20:17 (GMT+00:00)</div><div>To: LIST@LISTS
through the removal of the [space] character before the /div, then re-encode to base64 and put in back in the message testing box in the admin settings then the filter works.
I could post an e-mail via direct message if any help?
Here is a contrived e-mail I’ve made that I think demonstrates the problem. If you look at the HTML part it has a reply to an earlier message. The importer doesn’t seem to be able to see where the original message started.
This problem seems to affect messages from other mail clients too I’m now discovering. I can’t post in public the e-mails that generate the faults for everyone to look at but would be happy to let someone see them privately.
My current set up is that I have installed Discourse on a home server, e-mails to a listserv mailing list are sent to me (which goes to a Gmail account). If a ‘To:’ filter matches the name of the mailing list I have set Gmail to forward a copy of the e-mail to mailinglist@mydiscoursedomain.org.uk. Discourse has a category set to mirror a mailing list that looks for this e-mail.
The same issue comes if I use the import_mbox.sh script too having manually copied e-mails over, so it must be the part of the code that looks for the new part of the message that is getting confused.
Is there any way to make Discourse whizz through all the previously stored imported e-mails and try reformatting them using the plain text part of the original e-mails in case that is a temporary fix to the above problem? Before import it was set to use the HTML part. From peeking around using ‘rails c’ I can see each post seems to have the full text of the incoming messages stored (including e-mail headers). I’ve tried running the ‘rake posts:rebuild’ after turning off the HTML option and whilst it plods through all the messages slowly I’m not sure if anything changed, eg I tried turning on and off the show trimmed content option too but the little box with three dots still seems to be there on posts after the rake has finished.