Problem with umlauts when embedding Discourse on another website

limetti · December 10, 2023, 9:41pm

As described here (Embed Discourse comments on another website via Javascript - #453 by limetti), when embedding Discourse into my website, the title is correctly parsed. But as it contains umlauts, titles like “Ich würde” end up in “Ich wÃ¼rde”.

Is this a general problem, a problem with my page or any workaround for that? Thanks!

supermathie · December 11, 2023, 5:08am

This is a classic “wrong codec” problem.

As a test case, if we read (via python, in this example) the raw data from your post:

In [1]: import urllib

In [2]: u = urllib.request.urlopen('https://meta.discourse.org/posts/1418409/raw')

In [3]: r = u.read(); r
Out[3]: b'As described here (https://meta.discourse.org/t/embed-discourse-comments-on-another-website-via-javascript/31963/453), when embedding Discourse into my website, the title is correctly parsed. But as it contains umlauts, titles like \xe2\x80\x9cIch w\xc3\xbcrde\xe2\x80\x9d end up in \xe2\x80\x9cIch w\xc3\x83\xc2\xbcrde\xe2\x80\x9d.\n\nIs this a general problem, a problem with my page or any workaround for that? Thanks!'

We get bytes, but don’t know how to decode that. However, one of the response headers tells us we should use UTF-8:

In [4]: u.headers['content-type']
Out[4]: 'text/plain; charset=utf-8'

In [5]: r.decode('utf-8')
Out[5]: 'As described here (https://meta.discourse.org/t/embed-discourse-comments-on-another-website-via-javascript/31963/453), when embedding Discourse into my website, the title is correctly parsed. But as it contains umlauts, titles like “Ich würde” end up in “Ich wÃ¼rde”.\n\nIs this a general problem, a problem with my page or any workaround for that? Thanks!'

In [6]: print(r.decode('utf-8'))
As described here (https://meta.discourse.org/t/embed-discourse-comments-on-another-website-via-javascript/31963/453), when embedding Discourse into my website, the title is correctly parsed. But as it contains umlauts, titles like “Ich würde” end up in “Ich wÃ¼rde”.

Is this a general problem, a problem with my page or any workaround for that? Thanks!

You’ll note the characters look exactly as you posted. But when the wrong interpretation is made of those bytes — especially when the common mistake of interpreting these bytes as ISO-8859-1 instead of UTF-8 (string shortened for clarity below) is made, you get:

In [7]: snippet = r[220:255]; snippet
Out[7]: b'titles like \xe2\x80\x9cIch w\xc3\xbcrde\xe2\x80\x9d end up'

In [8]: snippet.decode('utf-8')
Out[8]: 'titles like “Ich würde” end up'

In [9]: snippet.decode('iso-8859-1')
Out[9]: 'titles like â\x80\x9cIch wÃ¼rdeâ\x80\x9d end up'

If I print that, my terminal hangs. Wild.

To sum up: whatever you’re using to pull the post data out of Discourse is treating it as iso-8859-1 instead of utf-8.

(speculating) Perhaps you’re embedding the raw bytes pulled from a Discourse site into a page that is being served with a codepage of iso-8859-1.

limetti · December 13, 2023, 7:53am

Thanks a lot for the hint. Indeed, the meta-tag UTF-8 was after the title-tag

Works now!

system · January 12, 2024, 7:53am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Apostrophies not working in embed WordPress	7	904	January 3, 2018
Comment Embed Isn't Working - How to troubleshoot? Support	3	1091	July 2, 2015
Reply-by-mail UTF-8 characters mis-rendering Support	0	305	June 30, 2021
Weird encoding issue on categories page Support unsupported-install	15	160	February 5, 2025
Issue with renaming user with unicode characters Support	13	871	December 25, 2022

Problem with umlauts when embedding Discourse on another website

Related topics