In [1]: import urllib
In [2]: u = urllib.request.urlopen('https://meta.discourse.org/posts/1418409/raw')
In [3]: r = u.read(); r
Out[3]: b'As described here (https://meta.discourse.org/t/embed-discourse-comments-on-another-website-via-javascript/31963/453), when embedding Discourse into my website, the title is correctly parsed. But as it contains umlauts, titles like \\xe2\\x80\\x9cIch w\\xc3\\xbcrde\\xe2\\x80\\x9d end up in \\xe2\\x80\\x9cIch w\\xc3\\x83\\xc2\\xbcrde\\xe2\\x80\\x9d.\\n\\nIs this a general problem, a problem with my page or any workaround for that? Thanks!'
In [4]: u.headers['content-type']
Out[4]: 'text/plain; charset=utf-8'
In [5]: r.decode('utf-8')
Out[5]: 'As described here (https://meta.discourse.org/t/embed-discourse-comments-on-another-website-via-javascript/31963/453), when embedding Discourse into my website, the title is correctly parsed. But as it contains umlauts, titles like “Ich würde” end up in “Ich würde”.\\n\\nIs this a general problem, a problem with my page or any workaround for that? Thanks!'
In [6]: print(r.decode('utf-8'))
As described here (https://meta.discourse.org/t/embed-discourse-comments-on-another-website-via-javascript/31963/453), when embedding Discourse into my website, the title is correctly parsed. But as it contains umlauts, titles like “Ich würde” end up in “Ich würde”.
Is this a general problem, a problem with my page or any workaround for that? Thanks!
In [7]: snippet = r[220:255]; snippet
Out[7]: b'titles like \\xe2\\x80\\x9cIch w\\xc3\\xbcrde\\xe2\\x80\\x9d end up'
In [8]: snippet.decode('utf-8')
Out[8]: 'titles like “Ich würde” end up'
In [9]: snippet.decode('iso-8859-1')
Out[9]: 'titles like â\x80\x9cIch würdeâ\x80\x9d end up'