jmana9
September 19, 2018, 2:48pm
1
I have a user who sends email through the built-in Seamonkey email client. When he emailed a message to my forum, the contents after the first line were indented and put into a quote block:
The message looks normal, with no indentation, in the HTML view:
When he tried to reply inline, things didn’t work well there either:
And here’s the HTML view:
If anyone would like to look into this, I will gladly PM the contents of the raw emails. This user is one of the most active users on our Yahoo Group who wants to continue to use email when we switch to Discourse, so I’m highly motivated to improve things for him - but I realize that probably doesn’t/shouldn’t translate to motivation on the development end!
1 Like
LeoMcA
(Leo McArdle)
September 19, 2018, 2:55pm
2
I imagine this is the same whitespace issue as caused by Thunderbird:
I imagine what’s going on here is that somewhere in the HTML → Markdown conversion some whitespace is being added (or not removed) producing something like this:
<img blah blah blah blah>
which is rendered like so:
<img blah blah blah blah>
Inline replies are pretty hard to process, see here:
Recently, someone replied to one of my posts on our site, simply saying:
Thanks, Dave. My comments are inline
But that was it. On closer inspection, I noticed that the post was a reply by email. I clicked on the little email icon and saw the rest of what they had said, sprinkled in like:
Thanks, Dave. My comments are inline
On Thu, Jun 21, 2018 at 2:45 PM David McClure <ourforum@discoursemail.com>
wrote:
> dmcclure <https://www.ourforum.com/u/dmcclure> David McClure
> <https://www.ourforu…
2 Likes
jmana9
September 19, 2018, 3:19pm
3
Yeah, I had a feeling it might be. Do you know if the whitespace problem is going to be worked on?
It seems so, but since so many of our users frequently reply inline, and enough of them want to continue using email that I’m starting to worry that Discourse might not be a good fit for us if this can’t be improved. Our users have come to hate Yahoo Groups, but I can hear the comments now, “well if awful Yahoo can parse my emails properly, why can’t this new thing?” I wish we could get people to change their habits, but it’s a losing battle.
I really like Discourse, but if our users that prefer email stop posting because their messages don’t come through properly, that would be a big loss for our community.
It’s high on my todo list. I’ll work on it early October.
4 Likes
jmana9
September 19, 2018, 7:48pm
6
Awesome, thanks for the update Régis! Let me know if you need help testing anything or if you need copies of troublesome emails.
Do you think improving whitespace parsing is likely to improve the handling of inline replies as well?
jmana9
October 30, 2018, 12:32am
9
Has there been any progress on this?
sam
(Sam Saffron)
October 30, 2018, 5:15am
10
It is still on our radar as a high priority item… work has been delayed a bit due to other urgent work.
2 Likes
jmana9
October 30, 2018, 11:41am
11
Thanks for the update Sam
Finally got to it
This will be fixed once this PR is merged
master ← fix-html-to-markdown
merged 10:21AM - 30 Apr 20 UTC
TLDR; this commit vastly improves how whitespaces are handled when converting fr… om HTML to Markdown.
It also adds support for converting HTML `<tables>` to markdown tables.
---
The previous `remove_whitespaces!` method was traversing the whole HTML tree and used a heuristic to remove
leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements)
It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example.
One such example can be found [here](https://meta.discourse.org/t/86782).
For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser.
The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules).
They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following:
- Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line)
- Remove any leading/trailing whitespaces inside an inline context
One quick & dirty way of getting this 90% solved would be to do `HTML.gsub!(/[[:space:]]+/, " ")`.
We would also need to hoist `<pre>` elements in order to not mess with their whitespaces.
Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more `.strip!` calls than I can bear.
I decided to "_emulate_" the browser's handling of whitespaces and came up with a solution in 4 parts
#### 1. `remove_not_allowed!`
The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown.
All the nodes that aren't handled by the library (eg. `<script>`, `<style>` or any non-textual HTML tags) are "swallowed".
In order to reduce the number of nodes visited, the method `remove_not_allowed!` will automatically delete all the nodes
that have no "visitor" (eg. a `visit_<tag>` method) defined.
#### 2. `remove_hidden!`
Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden.
The `remove_hidden!` method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0.
#### 3. `hoist_line_breaks!`
The `hoist_line_breaks!` method is there to handle `<br>` tags. I know those tiny `<br>` don't do much but they can be quite annoying.
The `<br>` tags are inline elements but they visually work like a block element (ie. they create a new line).
If you have the following HTML "`<i>Foo<br>Bar</i>`", it ends up visually similar to "`<i>Foo</i><br><i>Bar</i>`".
The latter being much more easy to process than the former, so that's what this method is doing.
The `hoist_line_breaks` will hoist `<br>` tags out of inline tags until their parent is a block element.
#### 4. `remove_whitespaces!`
The `remove_whitespaces!` is where all the whitespace removal is happening. It's broken down into 4 methods as well
- `remove_whitespaces!`
- `is_inline?`
- `collapse_spaces!`
- `remove_trailing_space!`
The `remove_whitespace!` method is recursively walking the HTML tree (skipping `<pre>` tags).
If a node has any children, they will be chunked into groups of inline elements vs block elements.
For each chunks of inline elements, it will call the `collapse_space!` and `remove_trailing_space!` methods.
For each chunks of block elements, it will call `remote_whitespace!` to keep walking the HTML tree recursively.
The `is_inline?` method determines whether a node is part of a inline context.
A node is inline iif it's a text node or it's an inline tag, but not `<br>`, and all its children are also inline.
The `collapse_spaces!` method will collapse any kind of (white) space into a single space (" ") character, even across tags.
For example, if we have "` Foo \n<i> Bar </i>\t42`", it will return "`Foo <i>Bar </i>42`".
Finally, the `remove_trailing_space!` method is there to remove any trailing space that might creep in at the end of the inline chunk.
This solution is not 100% bullet-proof.
It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed.
---
FIX: switched Nokogiri to Nokogumbo for better HTML5 parsing
FIX: better detection of hidden elements when converting HTML to Markdown
FIX: take into account the `allowed_href_schemes` site setting when converting HTML `<a>` to Markdown
FIX: added support for 'mailto:' scheme when converting `<a>` from HTML to Markdown
FIX: added support for `<img>` dimensions when converting from HTML to Markdown
FIX: added support for `<dl>`, `<dd>` and `<dt>` when converting from HTML to Markdown
FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown
FIX: added support for `<acronym>` when converting from HTML to Markdown
DEV: remove unused 'sanitize' gem
Wow, did you just read all that?! Congratz, here's a cookie: 🍪.
2 Likes