HTML email conversion drops tables and everything trailing it

brechtm · February 10, 2020, 3:41pm

With incoming email prefer html enabled, HTML to markdown conversion seems to abort when encountering an HTML table. The table doesn’t even get transformed to plain text, as mentioned in Maintain tables from HTML email in when using HTML to Markdown. It seems the allow html tables setting has disappeared since then, so I’m assuming this is now always enabled?

Anything trailing the tables is missing from the end result too, badly crippling emails containing tables.

I previously reported this in the aforementioned topic, but I’m now creating this new topic in the bug category to make sure this bug is recorded properly.

Note that pasting the HTML tables into a post works fine, suggesting that there’s no reason HTML tables couldn’t simply be included as-is into the Markdown post.

jomaxro · February 11, 2020, 5:47pm

Thanks for the report @brechtm. Funny, I just reported this myself the other day. It’s on our radar for a fix.

zogstrip · April 29, 2020, 3:37pm

The reason was that the code handling HTML to Markdown conversion on the server-side did not support <table>

This will be fixed once this PR is merged

github.com/discourse/discourse

FIX: server-side HtmlToMarkdown improvements

discourse:master ← discourse:fix-html-to-markdown

opened 03:30PM - 29 Apr 20 UTC

ZogStriP

+422 -207

TLDR; this commit vastly improves how whitespaces are handled when converting fr…om HTML to Markdown. It also adds support for converting HTML `<tables>` to markdown tables. --- The previous `remove_whitespaces!` method was traversing the whole HTML tree and used a heuristic to remove leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements) It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example. One such example can be found [here](https://meta.discourse.org/t/86782). For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser. The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules). They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following: - Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line) - Remove any leading/trailing whitespaces inside an inline context One quick & dirty way of getting this 90% solved would be to do `HTML.gsub!(/[[:space:]]+/, " ")`. We would also need to hoist `<pre>` elements in order to not mess with their whitespaces. Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more `.strip!` calls than I can bear. I decided to "_emulate_" the browser's handling of whitespaces and came up with a solution in 4 parts #### 1. `remove_not_allowed!` The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown. All the nodes that aren't handled by the library (eg. `<script>`, `<style>` or any non-textual HTML tags) are "swallowed". In order to reduce the number of nodes visited, the method `remove_not_allowed!` will automatically delete all the nodes that have no "visitor" (eg. a `visit_<tag>` method) defined. #### 2. `remove_hidden!` Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden. The `remove_hidden!` method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0. #### 3. `hoist_line_breaks!` The `hoist_line_breaks!` method is there to handle ` ` tags. I know those tiny ` ` don't do much but they can be quite annoying. The ` ` tags are inline elements but they visually work like a block element (ie. they create a new line). If you have the following HTML "`Foo Bar`", it ends up visually similar to "`Foo Bar`". The latter being much more easy to process than the former, so that's what this method is doing. The `hoist_line_breaks` will hoist ` ` tags out of inline tags until their parent is a block element. #### 4. `remove_whitespaces!` The `remove_whitespaces!` is where all the whitespace removal is happening. It's broken down into 4 methods as well - `remove_whitespaces!` - `is_inline?` - `collapse_spaces!` - `remove_trailing_space!` The `remove_whitespace!` method is recursively walking the HTML tree (skipping `<pre>` tags). If a node has any children, they will be chunked into groups of inline elements vs block elements. For each chunks of inline elements, it will call the `collapse_space!` and `remove_trailing_space!` methods. For each chunks of block elements, it will call `remote_whitespace!` to keep walking the HTML tree recursively. The `is_inline?` method determines whether a node is part of a inline context. A node is inline iif it's a text node or it's an inline tag, but not ` `, and all its children are also inline. The `collapse_spaces!` method will collapse any kind of (white) space into a single space (" ") character, even across tags. For example, if we have "` Foo \n Bar \t42`", it will return "`Foo Bar 42`". Finally, the `remove_trailing_space!` method is there to remove any trailing space that might creep in at the end of the inline chunk. This solution is not 100% bullet-proof. It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed. --- FIX: switched Nokogiri to Nokogumbo for better HTML5 parsing FIX: better detection of hidden elements when converting HTML to Markdown FIX: take into account the `allowed_href_schemes` site setting when converting HTML `<a>` to Markdown FIX: added support for 'mailto:' scheme when converting `<a>` from HTML to Markdown FIX: added support for `<img>` dimensions when converting from HTML to Markdown FIX: added support for `<dl>`, `<dd>` and `<dt>` when converting from HTML to Markdown FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown FIX: added support for `<acronym>` when converting from HTML to Markdown DEV: remove unused 'sanitize' gem Wow, did you just read all that?! Congratz, here's a cookie: 🍪.

Topic		Replies	Views
Maintain tables from HTML email in when using HTML to Markdown Feature	6	1321	April 30, 2020
Disable indent code block for emails Feature	21	1805	April 30, 2020
Change <br> into \r\n in incoming emails Support	0	256	July 14, 2022
Markdown isn't processed inside tables UX markdown-it-review	3	4046	June 26, 2017
Email-in containing <br></b> Bug	3	864	December 18, 2017

HTML email conversion drops tables and everything trailing it

Related topics