Certain unicode entities are being escaped


(Sam Saffron) #1

This is a test ℵ test

works in markdown preview …


'Approximately equals' HTML character bug
&Colon ; gets reverted when posted
(Sam Saffron) #2

This is a bug in nokogiri, being tracked here:

https://github.com/sparklemotion/nokogiri/issues/1173

Its actually a bit of a nightmare, the root of the issue is in libxml, I raised it on the mailing list there.


(lid) #3

This is a test

ℵ works

:heart:


Well it is easy to fix using javascript.as a content post-processor.

$("p").each (function (a,elm){
var $elm = $(elm);
$elm.html( $elm.html().replace(/(&)([A-z]+;)/,"&$2") );
});

Or as a pre processor by converting unsupported entities by libxml into their numeric representation
http://jsfiddle.net/lid0/0hctyy8e/

function replace_named_ent_with_code_ent(a,b,c){
     var extend_ent = {
                    "aleph":"8501",
                    "infin":"8734",
                    "otimes":"8855",
                    "radic":"8730",
                   }
      if (extend_ent.hasOwnProperty (c) ){
        return "&#" + extend_ent[c] +";"
      } else {
        return a
      }
}
nt = "ℵ".replace(/(&)([A-z]+);/g,replace_named_ent_with_code_ent);
// nt => ℵ

The problem with such a fix is how it will distinguish between an entity that needs a fix and one that doesn’t. better to wait for libxml to support it.


#4

The real problem is that the preview and baked post are different.


(lid) #5

I have pushed a PR that add support for unicode entities.
it is using Dialect to replace named entities into numeric entities , and therefor Nokogiri / Libxml processing will not break the encoding.

Input:

output:

https://github.com/discourse/discourse/pull/2846/commits


(Sam Saffron) #6

Appreciate the attempt, but we can not accept a patch at this level.

This needs to be fixed in nokogiri or libxml2


(Jeff Atwood) #7

(Sam Saffron) #8

FYI, I just posted a question to libxml2 on ancient mailing list

If they reply to me I will reopen.


(Jeff Atwood) #9

Yeah I just figure this is not our bug…


(Sam Saffron) #10

Another update … I managed … somehow get this posted on the libxml mailing list

https://mail.gnome.org/archives/xml/2015-August/msg00000.html

No response, but still progress