Punctuation on a line causing stripping in plaintext mail


(Greg) #1

I think we’ve hit an odd edge-case in the inbound email parser. A colleague just sent a post by mail, and on one line it contained a long URL followed by a comma, eg

As I've suggested in https://reallylong/thing/that/has/many/levels#and-a-bunch-of-params-are-here-too,
I will revert the changes in tasks that caused this troubles and release a new version.

However, his email client has tried to flow the paragraph, so what Discourse has actually received is this:

As I've suggested in
https://reallylong/thing/that/has/many/levels#and-a-bunch-of-params-are-here-too
,
I will revert the changes in tasks that caused this troubles and release a
new version

Note how the URL is on it’s own line, and the comma is shifted to a new line. This appears to cause the email parser a problem - the URL becomes a OneBox, and the comma and everything after it is dropped from the post. I’ve tested this by sending the exact same message again, but without the comma, and the second part of the paragraph is then included.

Is there some kind of “a line of all punctuation” regex that is scanning for signatures or something that might be causing this behaviour?


Email cut off at -----
(Gerhard Schlager) #2

Parsing emails is not an easy task and, as you noticed yourself, email clients aren’t always helpful either…
In your case the text after the URL was probably trimmed. If you always want to see the trimmed text, you can enable the corresponding setting. Search for always show trimmed content in Admin -> Settings.

Also, when you are a staff user, you can click on the envelope icon at the right top corner of the post to see the original email.


(Greg) #3

Completely agree, it’s a pain. I was just wondering if this is a case of a regex that needs a tweak or something like that - matching a single comma seems overly harsh, so if it’s /^{,.-}*$/ or something like that, perhaps it could be tweaked to require at least two repetitons or similar. Do you happen to have a link to the code that handles trimming? I’d be willing to poke around…

I’m not sure I want to enable that. As the warning says, it can reveal email address, and anyway (in general) Discourse does a great job of removing unnecessary content, and I like that. It’s just in this edge case that we’ve lost some valid content, so I thought I’d report it first and see what the feeling is.


(Gerhard Schlager) #4

It’s not lost, it’s just hidden :wink:

You can take a look at GitHub - discourse/email_reply_trimmer: Library to trim replies from plain text email.
There’s lots of regex magic happening in there. Feel free to submit a PR that fixes this issue.


(Greg) #5

Well it’s lost to all except staff who can see the original email :slight_smile: - if an admin doesn’t notice / isn’t alerted by the user, it’ll never come back (and it’s already too late for the emails sent out by mailing list mode).

Thanks though, I think email_reply_trimmer/delimiter_matcher.rb at master · discourse/email_reply_trimmer · GitHub is what’s causing the issue (basicially it is what I though, a [punctuation]* regex). I’ll have a think about fixes :slight_smile: