メールのコードブロックのインデントを無効化

zogstrip · 2020 年 4 月 29 日午後 3:35

これで未来はより明るくなりますね

この PR がマージされれば修正されます

github.com/discourse/discourse

FIX: server-side HtmlToMarkdown improvements (#9586)

master ← fix-html-to-markdown

merged 10:21AM - 30 Apr 20 UTC

ZogStriP

+422 -207

TLDR; this commit vastly improves how whitespaces are handled when converting fr…om HTML to Markdown. It also adds support for converting HTML `<tables>` to markdown tables. --- The previous `remove_whitespaces!` method was traversing the whole HTML tree and used a heuristic to remove leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements) It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example. One such example can be found [here](https://meta.discourse.org/t/86782). For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser. The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules). They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following: - Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line) - Remove any leading/trailing whitespaces inside an inline context One quick & dirty way of getting this 90% solved would be to do `HTML.gsub!(/[[:space:]]+/, " ")`. We would also need to hoist `<pre>` elements in order to not mess with their whitespaces. Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more `.strip!` calls than I can bear. I decided to "_emulate_" the browser's handling of whitespaces and came up with a solution in 4 parts #### 1. `remove_not_allowed!` The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown. All the nodes that aren't handled by the library (eg. `<script>`, `<style>` or any non-textual HTML tags) are "swallowed". In order to reduce the number of nodes visited, the method `remove_not_allowed!` will automatically delete all the nodes that have no "visitor" (eg. a `visit_<tag>` method) defined. #### 2. `remove_hidden!` Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden. The `remove_hidden!` method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0. #### 3. `hoist_line_breaks!` The `hoist_line_breaks!` method is there to handle ` ` tags. I know those tiny ` ` don't do much but they can be quite annoying. The ` ` tags are inline elements but they visually work like a block element (ie. they create a new line). If you have the following HTML "`Foo Bar`", it ends up visually similar to "`Foo Bar`". The latter being much more easy to process than the former, so that's what this method is doing. The `hoist_line_breaks` will hoist ` ` tags out of inline tags until their parent is a block element. #### 4. `remove_whitespaces!` The `remove_whitespaces!` is where all the whitespace removal is happening. It's broken down into 4 methods as well - `remove_whitespaces!` - `is_inline?` - `collapse_spaces!` - `remove_trailing_space!` The `remove_whitespace!` method is recursively walking the HTML tree (skipping `<pre>` tags). If a node has any children, they will be chunked into groups of inline elements vs block elements. For each chunks of inline elements, it will call the `collapse_space!` and `remove_trailing_space!` methods. For each chunks of block elements, it will call `remote_whitespace!` to keep walking the HTML tree recursively. The `is_inline?` method determines whether a node is part of a inline context. A node is inline iif it's a text node or it's an inline tag, but not ` `, and all its children are also inline. The `collapse_spaces!` method will collapse any kind of (white) space into a single space (" ") character, even across tags. For example, if we have "` Foo \n Bar \t42`", it will return "`Foo Bar 42`". Finally, the `remove_trailing_space!` method is there to remove any trailing space that might creep in at the end of the inline chunk. This solution is not 100% bullet-proof. It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed. --- FIX: switched Nokogiri to Nokogumbo for better HTML5 parsing FIX: better detection of hidden elements when converting HTML to Markdown FIX: take into account the `allowed_href_schemes` site setting when converting HTML `<a>` to Markdown FIX: added support for 'mailto:' scheme when converting `<a>` from HTML to Markdown FIX: added support for `<img>` dimensions when converting from HTML to Markdown FIX: added support for `<dl>`, `<dd>` and `<dt>` when converting from HTML to Markdown FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown FIX: added support for `<acronym>` when converting from HTML to Markdown DEV: remove unused 'sanitize' gem Wow, did you just read all that?! Congratz, here's a cookie: 🍪.

トピック		返信	表示
Any way to turn off code-block indent from emails Support	9	1468	2018 年 4 月 6 日
Disable Code Formatting for inappropriate 4 space indent? Feature	10	1650	2020 年 1 月 17 日
Email trimming improvement (no trimming in code blocks) Feature email	6	186	2024 年 11 月 30 日
Code blocks in emails have empty newlines Support	3	541	2022 年 6 月 21 日
Email reply includes inline styles with "incoming email prefer html" site setting on Support	14	2129	2018 年 4 月 14 日

メールのコードブロックのインデントを無効化

関連トピック