Bidirectional characters in LTR languages post security fix

yaron · November 22, 2021, 7:33am

I just ran into the following PR:

github.com/discourse/discourse

SECURITY: Strip unrendered unicode bidirectional chars in code blocks

discourse:main ← discourse:issue/security-fix-CVE-2021-42574

opened 11:49PM - 21 Nov 21 UTC

martin-brennan

+130 -0

When rendering the markdown code blocks we replace the offending characters in …the output string with spans highlighting a textual representation of the character, along with a title attribute with information about why the character was highlighted. The list of characters stripped by this fix, which are the bidirectional characters considered relevant, are: U+202A U+202B U+202C U+202D U+202E U+2066 U+2067 U+2068 U+2069 ![image](https://user-images.githubusercontent.com/920448/142784052-56805e94-1592-498e-b787-e954c4d89550.png)

I think it my render legit Hebrew or Arabic text unreadable.

One of the solutions I ran into was disabling the unicode algorithms and just display some representation of non-printable characters (I think it was implemented in Pootle).
So basically the idea is to turn:
This‎‏ text

Into:
This<LRM><RLM> text

This way to user can choose if this is malicious or not by understanding what the actual characters are and possibly choose to enable the unicode algorithms to be able to read the text properly.
Thanks.

martin · November 22, 2021, 10:30pm

Thank you for raising this, we did think of this concern. The fix you linked in the OP only applies to unicode bidirectional characters in pre and code blocks, either manually written as HTML or generated from ``` markdown fenced code blocks, so it should not be an issue with regular Hebrew or Arabic text in a composed post.

sam · November 22, 2021, 10:42pm

Demo:

#include <stdio.h>

int main() {
    /* Say hello; newline<U+2067> /*/ return 0 ;
    printf("Hello world.\n");
    return 0;
}

#include 

int main() {
    /* Say hello; newline<U+2067> /*/ return 0 ;
    printf("Hello world.\n");
    return 0;
}

Test: ‫"שלום חבר" - Hello Friend

Without BIDI

Test: “שלום חבר” - Hello Friend

Markdown:

Test: &#x202B;"שלום חבר" - Hello Friend

Without BIDI

Test: "שלום חבר" - Hello Friend

Not the best example in the world, but you should get the gist here, only impacts source code being posted on the forum. Bidi chars in source code is not something that is usually done.

yaron · November 23, 2021, 3:21pm

I’ll give another example where no RLM does break the sentence.

שלום לכולם ובמיוחד ל־Sam, Martin בחר לעזוב אותנו.

שלום לכולם ובמיוחד ל־Sam,‏ Martin בחר לעזוב אותנו.

Do you see the difference?
The only change there is RLM, I wanted to congratulate Sam and inform that Martin is leaving (No offense).

sam · November 23, 2021, 8:16pm

Yes, that example is certainly much better! As you can see it continues to work and is not impacted by the security fix

yaron · November 24, 2021, 1:27pm

Hmmm it’s not a codeblock
I meant that inside a codeblock it won’t appear as expected (This is what the fix is all about, am I right?)

sam · November 24, 2021, 9:39pm

Yeah but why would you include it in a code block?

yaron · November 24, 2021, 9:56pm

Excerpt from gettext, Hebrew/Arabic native strings, there are such cases.

sam · November 25, 2021, 1:47am

I would the outlier case here has workarounds (screenshot, attachment uploads and so on), also it is pretty clear that the special char is in place.

The risk of https://trojansource.codes/ is higher than the risk of mild disruption in extreme outlier cases.

yaron · November 25, 2021, 5:47am

But my suggestion breaks the sentence with some cue, so replacing the RLM and LRM with <RLM> or <LRM> will show the user that there were some additional characters and now the text is rendered without them yet informing that it might break the experience and that there’s an option to replace back manually if needed, removing the characters completely without some indicators gives no room for educated decisions.

And it will also prevent trojan source code as you mentioned because the user will be able to see the malicious code with the indicators.

I will try to get some screenshots from Pootle, I don’t remember seeing that raw strings option in the past couple of years, it was very useful when we started fixing the LibreOffice localization.

sam · November 25, 2021, 6:10am

Not following, we do not strip we replace , see my example above

yaron · November 28, 2021, 4:04pm

I understand, wouldn’t it be better to use their names instead of Unicode entity?

sam · November 28, 2021, 9:05pm

If there is repeat confusion reported in the wild we can certainly fine tune

system · December 28, 2021, 9:05pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wrong -> arrow direction in RTL text contexts Feature	12	223	May 26, 2025
Uncode Emoji incorrectly converted to text inside code blocks Bug markdown-it-review	2	1218	March 23, 2016
Markdown css styles not shown when RTL is enable Bug	9	1956	July 16, 2018
Right to left override messes up the text in reviews Bug review-queue	5	79	April 30, 2025
Links are Left-to-Right by default Bug	1	535	May 19, 2019

Bidirectional characters in LTR languages post security fix

Related topics