Certain unicode entities are being escaped

(Sam Saffron) #1

This is a test ℵ test

works in markdown preview …

'Approximately equals' HTML character bug
&Colon ; gets reverted when posted
(Sam Saffron) #2

This is a bug in nokogiri, being tracked here:


Its actually a bit of a nightmare, the root of the issue is in libxml, I raised it on the mailing list there.

(lid) #3

This is a test

ℵ works


Well it is easy to fix using javascript.as a content post-processor.

$("p").each (function (a,elm){
var $elm = $(elm);
$elm.html( $elm.html().replace(/(&)([A-z]+;)/,"&$2") );

Or as a pre processor by converting unsupported entities by libxml into their numeric representation

function replace_named_ent_with_code_ent(a,b,c){
     var extend_ent = {
      if (extend_ent.hasOwnProperty (c) ){
        return "&#" + extend_ent[c] +";"
      } else {
        return a
nt = "ℵ".replace(/(&)([A-z]+);/g,replace_named_ent_with_code_ent);
// nt => ℵ

The problem with such a fix is how it will distinguish between an entity that needs a fix and one that doesn’t. better to wait for libxml to support it.


The real problem is that the preview and baked post are different.

(lid) #5

I have pushed a PR that add support for unicode entities.
it is using Dialect to replace named entities into numeric entities , and therefor Nokogiri / Libxml processing will not break the encoding.




(Sam Saffron) #6

Appreciate the attempt, but we can not accept a patch at this level.

This needs to be fixed in nokogiri or libxml2

(Jeff Atwood) #7

(Sam Saffron) #8

FYI, I just posted a question to libxml2 on ancient mailing list

If they reply to me I will reopen.

(Jeff Atwood) #9

Yeah I just figure this is not our bug…

(Sam Saffron) #10

Another update … I managed … somehow get this posted on the libxml mailing list


No response, but still progress