jmana9
August 17, 2018, 4:36am
1
On the site I am test driving with a few willing participants (v2.1.0.beta3 +136), someone created a new topic via email using Thunderbird with three inline images. Something got messed up, as this is what came through where the images should have been:
<img moz-do-not-send="false" src="/uploads/default/original/1X/1c7fd6675d43e41fd6451802c7595ece0d9cea4d.jpg" alt="WE 2500 -
transparent" height="300" width="400">
<img moz-do-not-send="false" src="/uploads/default/original/1X/e40f133566b4b9e17da781cd78f0a22657a19e87.jpg" alt="WE 2500
transparent - back" height="300" width="400">
<img moz-do-not-send="false" src="/uploads/default/original/1X/97df6e399840418821cbfbdabcb6af0de5d7cfd0.jpg" alt="WE 2500
transparent" height="300" width="400">
The image that the user sent inline is in fact stored on my server at the aforementioned src location, so Discourse is apparently able to extract the image from the email message. However, instead of putting the image into the post, it just replaces the src tag from the email with the location where the image is stored. See the same snippet from the original email message:
<img moz-do-not-send="false"
src="cid:part1.3D21F859.C11923DA@verizon.net" alt="WE 2500 -
transparent" class="" height="300" width="400"><br>
<br>
<img moz-do-not-send="false"
src="cid:part2.0838A5E5.943C8815@verizon.net" alt="WE 2500
transparent - back" class="" height="300" width="400"><br>
<br>
<img moz-do-not-send="false"
src="cid:part3.E9D6AC1E.7457D6BA@verizon.net" alt="WE 2500
transparent" class="" height="300" width="400"><br>
Iām able to post inline images from Gmail, so this seems like a Discourse/Thunderbird interaction issue, and not an email inline image problem in general.
If anyone would like to see the full email message, Iād gladly send it over privately. The pictures came through fine when he forwarded it to my Gmail account.
sam
(Sam Saffron)
August 17, 2018, 4:50am
2
Can you reply via email to this message with an example failure?
1 Like
jmana9
August 17, 2018, 5:28am
3
I just asked the user to tell me what steps he took to produce the failed message. I will try to reproduce it and then I will post back.
2 Likes
LeoMcA
(Leo McArdle)
August 17, 2018, 9:24am
4
I canāt see why this wouldnāt produce an image, <img> is valid markdown (at least on Discourse):
<img src='//assets-meta-cdck-prod-meta.s3.dualstack.us-west-1.amazonaws.com/original/3X/9/e/9e8c4ec0ee685ff6e7721b2da38a016c996c489c.png'>
Or do you mean the the tags literally came through like that? As in, the tags were rendered within a preformatted text block?
2 Likes
jmana9
August 17, 2018, 1:54pm
5
The post literally came through like that:
1 Like
LeoMcA
(Leo McArdle)
August 17, 2018, 2:19pm
6
A-ha, Iāve seen this a few times with Thunderbird (but never with images, just with text so it didnāt really impact the usability/readability of the email).
I imagine whatās going on here is that somewhere in the HTML ā Markdown conversion some whitespace is being added (or not removed) producing something like this:
<img blah blah blah blah>
which is rendered like so:
<img blah blah blah blah>
Getting the raw markdown output of the post will confirm this. Do that by replacing the /t/<slug>/ part of the url with /raw/, eg:
https://meta.discourse.org/t/inline-image-post-via-thunderbird-not-processed-properly/94961/5 ā https://meta.discourse.org/raw/94961/5
3 Likes
jmana9
August 17, 2018, 2:23pm
7
Hereās what I see:
Photos of a 2500 set with transparent housing.
<img moz-do-not-send="false" src="/uploads/default/original/1X/1c7fd6675d43e41fd6451802c7595ece0d9cea4d.jpg" alt="WE 2500 -
transparent" height="300" width="400">
<img moz-do-not-send="false" src="/uploads/default/original/1X/e40f133566b4b9e17da781cd78f0a22657a19e87.jpg" alt="WE 2500
transparent - back" height="300" width="400">
<img moz-do-not-send="false" src="/uploads/default/original/1X/97df6e399840418821cbfbdabcb6af0de5d7cfd0.jpg" alt="WE 2500
transparent" height="300" width="400">
And hereās a picture just in case that doesnāt format properly:
LeoMcA
(Leo McArdle)
August 17, 2018, 2:30pm
8
Yeah, as I thought, itās whitespace causing it to be rendered as a preformatted block.
Easiest way to debug this is with the raw email, but itāll contain private information, so I suggest you PM it to @zogstrip directly, rather than post it out in the open here.
You can extract the raw email by clicking the icon next the topic date in the top right, and copying and pasting the contents of the raw tab into a text file. Then upload that text file in a PM to @zogstrip .
@zogstrip I hope thatās alright with you (it has been whenever Iāve PMed you my dodgily formatted emails!)
4 Likes
jmana9
August 17, 2018, 2:37pm
9
Oh cool, I didnāt realize you could extract the raw email that way.
PM sent. Thanks for looking into this, Leo.
5 Likes
Better late than never
This will be fixed once this PR is merged
master ā fix-html-to-markdown
merged 10:21AM - 30 Apr 20 UTC
TLDR; this commit vastly improves how whitespaces are handled when converting fr⦠om HTML to Markdown.
It also adds support for converting HTML `<tables>` to markdown tables.
---
The previous `remove_whitespaces!` method was traversing the whole HTML tree and used a heuristic to remove
leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements)
It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example.
One such example can be found [here](https://meta.discourse.org/t/86782).
For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser.
The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules).
They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following:
- Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line)
- Remove any leading/trailing whitespaces inside an inline context
One quick & dirty way of getting this 90% solved would be to do `HTML.gsub!(/[[:space:]]+/, " ")`.
We would also need to hoist `<pre>` elements in order to not mess with their whitespaces.
Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more `.strip!` calls than I can bear.
I decided to "_emulate_" the browser's handling of whitespaces and came up with a solution in 4 parts
#### 1. `remove_not_allowed!`
The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown.
All the nodes that aren't handled by the library (eg. `<script>`, `<style>` or any non-textual HTML tags) are "swallowed".
In order to reduce the number of nodes visited, the method `remove_not_allowed!` will automatically delete all the nodes
that have no "visitor" (eg. a `visit_<tag>` method) defined.
#### 2. `remove_hidden!`
Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden.
The `remove_hidden!` method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0.
#### 3. `hoist_line_breaks!`
The `hoist_line_breaks!` method is there to handle `<br>` tags. I know those tiny `<br>` don't do much but they can be quite annoying.
The `<br>` tags are inline elements but they visually work like a block element (ie. they create a new line).
If you have the following HTML "`<i>Foo<br>Bar</i>`", it ends up visually similar to "`<i>Foo</i><br><i>Bar</i>`".
The latter being much more easy to process than the former, so that's what this method is doing.
The `hoist_line_breaks` will hoist `<br>` tags out of inline tags until their parent is a block element.
#### 4. `remove_whitespaces!`
The `remove_whitespaces!` is where all the whitespace removal is happening. It's broken down into 4 methods as well
- `remove_whitespaces!`
- `is_inline?`
- `collapse_spaces!`
- `remove_trailing_space!`
The `remove_whitespace!` method is recursively walking the HTML tree (skipping `<pre>` tags).
If a node has any children, they will be chunked into groups of inline elements vs block elements.
For each chunks of inline elements, it will call the `collapse_space!` and `remove_trailing_space!` methods.
For each chunks of block elements, it will call `remote_whitespace!` to keep walking the HTML tree recursively.
The `is_inline?` method determines whether a node is part of a inline context.
A node is inline iif it's a text node or it's an inline tag, but not `<br>`, and all its children are also inline.
The `collapse_spaces!` method will collapse any kind of (white) space into a single space (" ") character, even across tags.
For example, if we have "` Foo \n<i> Bar </i>\t42`", it will return "`Foo <i>Bar </i>42`".
Finally, the `remove_trailing_space!` method is there to remove any trailing space that might creep in at the end of the inline chunk.
This solution is not 100% bullet-proof.
It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed.
---
FIX: switched Nokogiri to Nokogumbo for better HTML5 parsing
FIX: better detection of hidden elements when converting HTML to Markdown
FIX: take into account the `allowed_href_schemes` site setting when converting HTML `<a>` to Markdown
FIX: added support for 'mailto:' scheme when converting `<a>` from HTML to Markdown
FIX: added support for `<img>` dimensions when converting from HTML to Markdown
FIX: added support for `<dl>`, `<dd>` and `<dt>` when converting from HTML to Markdown
FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown
FIX: added support for `<acronym>` when converting from HTML to Markdown
DEV: remove unused 'sanitize' gem
Wow, did you just read all that?! Congratz, here's a cookie: šŖ.
4 Likes