Problem with umlauts when embedding Discourse on another website

As described here (Embed Discourse comments on another website via Javascript - #453 by limetti), when embedding Discourse into my website, the title is correctly parsed. But as it contains umlauts, titles like “Ich würde” end up in “Ich würde”.

Is this a general problem, a problem with my page or any workaround for that? Thanks!

This is a classic “wrong codec” problem.

As a test case, if we read (via python, in this example) the raw data from your post:

In [1]: import urllib

In [2]: u = urllib.request.urlopen('https://meta.discourse.org/posts/1418409/raw')

In [3]: r = u.read(); r
Out[3]: b'As described here (https://meta.discourse.org/t/embed-discourse-comments-on-another-website-via-javascript/31963/453), when embedding Discourse into my website, the title is correctly parsed. But as it contains umlauts, titles like \xe2\x80\x9cIch w\xc3\xbcrde\xe2\x80\x9d end up in \xe2\x80\x9cIch w\xc3\x83\xc2\xbcrde\xe2\x80\x9d.\n\nIs this a general problem, a problem with my page or any workaround for that? Thanks!'

We get bytes, but don’t know how to decode that. However, one of the response headers tells us we should use UTF-8:

In [4]: u.headers['content-type']
Out[4]: 'text/plain; charset=utf-8'

In [5]: r.decode('utf-8')
Out[5]: 'As described here (https://meta.discourse.org/t/embed-discourse-comments-on-another-website-via-javascript/31963/453), when embedding Discourse into my website, the title is correctly parsed. But as it contains umlauts, titles like “Ich würde” end up in “Ich würde”.\n\nIs this a general problem, a problem with my page or any workaround for that? Thanks!'

In [6]: print(r.decode('utf-8'))
As described here (https://meta.discourse.org/t/embed-discourse-comments-on-another-website-via-javascript/31963/453), when embedding Discourse into my website, the title is correctly parsed. But as it contains umlauts, titles like “Ich würde” end up in “Ich würde”.

Is this a general problem, a problem with my page or any workaround for that? Thanks!

You’ll note the characters look exactly as you posted. But when the wrong interpretation is made of those bytes — especially when the common mistake of interpreting these bytes as ISO-8859-1 instead of UTF-8 (string shortened for clarity below) is made, you get:

In [7]: snippet = r[220:255]; snippet
Out[7]: b'titles like \xe2\x80\x9cIch w\xc3\xbcrde\xe2\x80\x9d end up'

In [8]: snippet.decode('utf-8')
Out[8]: 'titles like “Ich würde” end up'

In [9]: snippet.decode('iso-8859-1')
Out[9]: 'titles like â\x80\x9cIch würdeâ\x80\x9d end up'

If I print that, my terminal hangs. Wild. :smiley:

To sum up: whatever you’re using to pull the post data out of Discourse is treating it as iso-8859-1 instead of utf-8.

(speculating) Perhaps you’re embedding the raw bytes pulled from a Discourse site into a page that is being served with a codepage of iso-8859-1.

3 Likes

Thanks a lot for the hint. Indeed, the meta-tag UTF-8 was after the title-tag :wink:

Works now!

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.