I think it my render legit Hebrew or Arabic text unreadable.
One of the solutions I ran into was disabling the unicode algorithms and just display some representation of non-printable characters (I think it was implemented in Pootle).
So basically the idea is to turn:
Thisโโ text
Into:
This<LRM><RLM> text
This way to user can choose if this is malicious or not by understanding what the actual characters are and possibly choose to enable the unicode algorithms to be able to read the text properly.
Thanks.
Thank you for raising this, we did think of this concern. The fix you linked in the OP only applies to unicode bidirectional characters in pre and code blocks, either manually written as HTML or generated from ``` markdown fenced code blocks, so it should not be an issue with regular Hebrew or Arabic text in a composed post.
Not the best example in the world, but you should get the gist here, only impacts source code being posted on the forum. Bidi chars in source code is not something that is usually done.
But my suggestion breaks the sentence with some cue, so replacing the RLM and LRM with <RLM> or <LRM> will show the user that there were some additional characters and now the text is rendered without them yet informing that it might break the experience and that thereโs an option to replace back manually if needed, removing the characters completely without some indicators gives no room for educated decisions.
And it will also prevent trojan source code as you mentioned because the user will be able to see the malicious code with the indicators.
I will try to get some screenshots from Pootle, I donโt remember seeing that raw strings option in the past couple of years, it was very useful when we started fixing the LibreOffice localization.