Bidirectional characters in LTR languages post security fix

I just ran into the following PR:

I think it my render legit Hebrew or Arabic text unreadable.

One of the solutions I ran into was disabling the unicode algorithms and just display some representation of non-printable characters (I think it was implemented in Pootle).
So basically the idea is to turn:
Thisโ€Žโ€ text

Into:
This<LRM><RLM> text

This way to user can choose if this is malicious or not by understanding what the actual characters are and possibly choose to enable the unicode algorithms to be able to read the text properly.
Thanks.

3 Likes

Thank you for raising this, we did think of this concern. The fix you linked in the OP only applies to unicode bidirectional characters in pre and code blocks, either manually written as HTML or generated from ``` markdown fenced code blocks, so it should not be an issue with regular Hebrew or Arabic text in a composed post.

2 Likes

Demo:

#include <stdio.h>

int main() {
    /* Say hello; newline<U+2067> /*/ return 0 ;
    printf("Hello world.\n");
    return 0;
}
#include 

int main() {
    /* Say hello; newline<U+2067> /*/ return 0 ;
    printf("Hello world.\n");
    return 0;
}

Test: โ€ซ"ืฉืœื•ื ื—ื‘ืจ" - Hello Friend

Without BIDI

Test: โ€œืฉืœื•ื ื—ื‘ืจโ€ - Hello Friend

Markdown:

Test: &#x202B;"ืฉืœื•ื ื—ื‘ืจ" - Hello Friend

Without BIDI

Test: "ืฉืœื•ื ื—ื‘ืจ" - Hello Friend

Not the best example in the world, but you should get the gist here, only impacts source code being posted on the forum. Bidi chars in source code is not something that is usually done.

5 Likes

Iโ€™ll give another example where no RLM does break the sentence.

ืฉืœื•ื ืœื›ื•ืœื ื•ื‘ืžื™ื•ื—ื“ ืœึพSam, Martin ื‘ื—ืจ ืœืขื–ื•ื‘ ืื•ืชื ื•.

ืฉืœื•ื ืœื›ื•ืœื ื•ื‘ืžื™ื•ื—ื“ ืœึพSam,โ€ Martin ื‘ื—ืจ ืœืขื–ื•ื‘ ืื•ืชื ื•.

Do you see the difference?
The only change there is RLM, I wanted to congratulate Sam and inform that Martin is leaving (No offense).

2 Likes

Yes, that example is certainly much better! As you can see it continues to work and is not impacted by the security fix :tada:

4 Likes

Hmmm itโ€™s not a codeblock :slight_smile:
I meant that inside a codeblock it wonโ€™t appear as expected (This is what the fix is all about, am I right?)

1 Like

Yeah but why would you include it in a code block?

2 Likes

Excerpt from gettext, Hebrew/Arabic native strings, there are such cases.

2 Likes

I would the outlier case here has workarounds (screenshot, attachment uploads and so on), also it is pretty clear that the special char is in place.

The risk of https://trojansource.codes/ is higher than the risk of mild disruption in extreme outlier cases.

3 Likes

But my suggestion breaks the sentence with some cue, so replacing the RLM and LRM with <RLM> or <LRM> will show the user that there were some additional characters and now the text is rendered without them yet informing that it might break the experience and that thereโ€™s an option to replace back manually if needed, removing the characters completely without some indicators gives no room for educated decisions.

And it will also prevent trojan source code as you mentioned because the user will be able to see the malicious code with the indicators.

I will try to get some screenshots from Pootle, I donโ€™t remember seeing that raw strings option in the past couple of years, it was very useful when we started fixing the LibreOffice localization.

2 Likes

Not following, we do not strip we replace , see my example above

3 Likes

I understand, wouldnโ€™t it be better to use their names instead of Unicode entity?

1 Like

If there is repeat confusion reported in the wild we can certainly fine tune

3 Likes