URLs being dropped from Thunderbird-generated replies

To be make my previous question a bit more precise, is it a bug on Thunderbird’s part that it puts class="moz-txt-link-freetext into hyperlinks in emails that it sends? Is it a bug on Discourse’s part that the presence of such a tag causes the <a>...</a>-enclosed URL to be dropped from the body of the message and subsequent <br> tags to be ignored?

1 Like

Hi @Falco and @riking — Sorry to bug you, but since you’d looked at this issue previously: do you have any thoughts on my questions in the previous post about whether this is buggy behavior on the part of Thunderbird and/or Discourse? Thanks.

Hi @BradCray ,
I’ve experienced this same problem too when using import_mbox.sh on mailman mbox archives.

I tracked the problem down to extract_from_mozilla(doc) in lib/email/receiver.rb and made a simple fix which I’ve put in a PR https://github.com/discourse/discourse/pull/13176 .

The extract_from_mozilla method was making any tag with a class attribute starting with "moz-" hidden. My adjustment excludes from these hidden tags any of them with a class attribute starting with ^moz-txt-link\b which I think covers Thunderbird’s links (based on the large mbox I was working with).

I haven’t had time to write tests so the PR might not get accepted but feel free to use it (although be aware it’ll probably be wiped out by any upgrade unless the PR is accepted).

I’ve only tested on imported mboxes but I think it should work on newly received email.
Unfortunately I don’t think it will apply during a rebake (it didn’t for me) so might be pretty hard to apply retrospectively.

4 Likes

@BradCray, for info the PR won’t be accepted without tests and it’ll be some time before I get a change to look into doing that.

Thanks for the attempt at fixing this and update, @bsoares! Maybe your PR will inspire someone else who’s already familiar with discourse testing to pick up the testing aspect if you can’t get to it sooner (?). I tend to be a little surprised that this issue hasn’t generated more interest or concern, as my impression is that Thunderbird is a reasonably popular email client (though admittedly, I don’t use it myself).

Thanks again,
-Brad

Thanks for the comments @BradCray . Hopefully I’ll find time – work has said I can spend some time on it but it probably won’t be for a couple of weeks at least. I’m also surprised there isn’t more mention of this, as you say Thunderbird is a popular email client. We’re just switching to discourse and I’ve been playing with importing mailman mboxes, my colleague noticed the problem more or less straight away! Took a bit to diagnose because there’s nothing wrong with the Thunderbird emails, just an overzealous sig/reply hider in lib/email/receiver.rb!

1 Like

@BradCray (and anyone else after a fix for this), I’m not going to have time to get a proper PR with tests any time soon, so what I have put in place for out discourse is a patch which should continue to work so long as the method in question doesn’t get changed (in which case I hope it gets fixed!).

Put the attached lib_email_receiver_rb-thunderbird_links.patch in, say /var/discourse/shared/standalone/patches.txt [lib_email_receiver_rb-thunderbird_links.patch.txt|attachment](upload://1OvwVDGUCO2Y3VtkGZnt6hCjAZk.txt) (819 Bytes) (you might have to

mkdir -p  /var/discourse/shared/standalone/patches

)
and then put the following in containers/app.yml in the hooks: after_code: section (after the plugins if you have any):

hooks:
  after_code:
    # plugins "-exec" here
    - exec:
        cd: $home
        cmd:
          - git apply /shared/patches/lib_email_receiver_rb-thunderbird_links.patch.txt

This assumes you have /var/discourse/shared/standalone mapped to /shared in the volumes: section, so adjust to suit your environment.

You’ll need to rebuild the container with ./launcher rebuild app but you should retain the fix whenever you rebuild. Not sure if a web-admin induced update would apply it though.

[The attachment has extension .txt so it can be attached to this comment, you may want to get rid of it in the filename and the config]

1 Like

I have the problem too. Therefore I would like to try the solution/workaround with the patch file. Unfortunately, the attachment lib_email_receiver_rb-thunderbird_links.patch was not uploaded correctly. Could you please resend it? Maybe simply as a code block?

4 Likes

Hi, I’ve attached the file again (with a ‘.txt’ added for upload reasons), and here’s the content anyway:

diff --git a/lib/email/receiver.rb b/lib/email/receiver.rb
index a6da67cbaa..38c8439f38 100644
--- a/lib/email/receiver.rb
+++ b/lib/email/receiver.rb
@@ -480,8 +480,12 @@ module Email
 
     def extract_from_mozilla(doc)
       # Mozilla (Thunderbird ?) properly identifies signature and forwarded emails
-      # Remove them and anything that comes after
-      elided = doc.css("*[class^='moz-'], *[class^='moz-'] ~ *").remove
+      # Remove them and anything that comes after. Leave in moz-txt-link- classes as they're links.
+      elided = doc.css("*[@class]:mozfilter", Class.new {
+        def mozfilter node_set
+          node_set.find_all { |node| node["class"] =~ /^moz-/ and node["class"] !~ /^moz-txt-link\b/ }
+        end
+      }.new).remove
       to_markdown(doc.to_html, elided.to_html)
     end

lib_email_receiver_rb-thunderbird_links.patch.txt (819 Bytes)

To fix a typo I’ve just spotted in my earlier post, to use this patch save this patch file in ~/lib_email_receiver_rb-thunderbird_links.patch.txt and then

mkdir -p /var/discourse/shared/standalone/patches
cp ~/lib_email_receiver_rb-thunderbird_links.patch.txt /var/discourse/shared/standalone/patches/

and then edit containers/app.yml as described above.

3 Likes

Thanks, works like a charm, except for the case of links with alias (like this one for example), which get separated …

1 Like

@Flominator Glad it’s at least partly working! I’d be keen to adapt the mozfilter part to not discard a link with alias. I hadn’t come across one like this, and haven’t been able to find documentation on thunderbird’s link classes. It’s not overly fancy though so it might not cope!
Would you be able to post or message me the HTML source in the email that created the post to have a look at? Thanks!

1 Like

It’s something like this:

    <p>some text, <a moz-do-not-send="true"
        href="https://ro.wikipedia.org/wiki/Crai_Nou,_Timi%C8%99">Links
        with alias</a> more text</p>
1 Like

Hmmm, in that example, and in the full email source you messaged me (thank you!) the moz-do-not-send attribute should not affect the display of the <img> or <a> tags that that attribute appears in. The mozfilter is only looking at values in the class attribute, and I can’t immediately see anywhere else that it might get filtered.

As such I can’t work out why the link-with-alias’s content and href get separated out, unless for some reason the discourse importer is suddenly deciding to use the plain/text part of the Mime encoded email (which does have the text and URL separated). Why it would do that I don’t know.

In your test discourse setup can you try importing/emailing a thunderbird HTML email with both a link-with-alias and, say an embedded image or something else that will mark it out as the HTML part of the email?

1 Like

Thanks for trying.

Then the picture which in the mail in between the text (left) turns to be at the end in Discourse (right):

1 Like

I’m still thinking this might be discourse using the other mime parts (in this case a plain/text part followed by an image/… part of the email to create its markdown version, though why I don’t know. Perhaps an HTML validator is rejecting the text/html part because of strictly-non-validating attributes like moz-do-not-send!?
Could you do one more test, with the same post (some text with an image in the middle, but also make some (but not all) of the text bold, even just one work. I think that will determine if the text part is coming from a text/plain or text/html block.

And sorry to ask, but just to make sure, you have incoming_email_prefer_html set to true (checked)?!

1 Like

The email:
grafik

The post:

2 Likes

Thanks @Flominator . I see that the emboldened text has become italicised, so it’s definitely not using the HTML directly, but it is picking up on the emphasis somehow. I wonder if the text/plain part of the email gets some kind of markup/down added – would you be able to PM me the email source like last time?

1 Like

Hi @Flominator , thanks for the raw email. Looking at the text/plain alternative part of the email does indeed put asterisks around the text that’s in bold in the text/html part. Most markdown renderers (such as the one in discourse) interpret this as italicised. Here’s the text/plain segment copied and pasted on its own:

Und nochmal soll ich für hier
https://meta.discourse.org/t/urls-being-dropped-from-thunderbird-generated-replies/163751/24
eine Testnachricht schicken.

Bild in die Mitte dann wieder Text von dem ein Teil sogar fett
geschrieben
ist

Gruß

Flo

which looks identical to your screenshot.

So what I think is happening is that the text/html segment is being rejected as invalid HTML (probably down to the non-standard moz-do-not-send attribute name in the a tags). This will require the patch to change how valid HTML is checked (possibly just removing those attributes) and I’m less confident how stable that will be without it going into the core code. I’ll have a look when I get some time.

4 Likes

Hi all following this topic,

I’ve just spotted (before updating) that a separate fix (along the same lines but more specific) for this issue has been committed:
issue: FIX: properly clean Thunderbird emails, don't remove links by ValdikSS · Pull Request #16543 · discourse/discourse · GitHub
commit: FIX: properly clean Thunderbird emails, don't remove links (#16543) · discourse/discourse@f7540aa · GitHub

This means that the patch attached in an above comment will fail (probably not “spectacularly” but it might then require a rebuild to get upgrades going again) when you upgrade to include this new commit (probably your next upgrade).
If you have it automatically being applied (e.g. with a git apply cmd in your app.yml as described above), you should remove that before your next upgrade. In fact a rebuild might be in order as that commit might fail to apply since the place in receiver.rb where it will want to apply the commit diff has already been changed by the patch.

I’m going to 1) remove the git apply cmd from app.yml, 2) rebuild app, 3) update (if it hasn’t already in the rebuild). I’ll let you know how that goes…

[10 minutes later…]

In the end I did the following instead because it doesn’t require any downtime during the rebuild.

  1. remove the git apply for the patch from app.yml (only needs to be done before your next app container rebuild)
  2. revert the patched file with:
    i) launcher enter app
    ii) (now in app container)
    cd /var/www/discourse
    git checkout ./lib/email/receiver.rb
    exit
  1. update discourse using the web admin update
1 Like

I’m the author of this patch. It works great for me, and I found no drawbacks.

2 Likes