如此处 (Embed Discourse comments on another website via Javascript - #453 by limetti) 所述,在将 Discourse 嵌入我的网站时,标题解析正确。但由于标题包含 umlauts(变音符号),像“Ich würde”这样的标题最终变成了“Ich würde”。
这是普遍存在的问题,还是我页面上的问题,或者有什么解决方法吗?谢谢!
如此处 (Embed Discourse comments on another website via Javascript - #453 by limetti) 所述,在将 Discourse 嵌入我的网站时,标题解析正确。但由于标题包含 umlauts(变音符号),像“Ich würde”这样的标题最终变成了“Ich würde”。
这是普遍存在的问题,还是我页面上的问题,或者有什么解决方法吗?谢谢!
这是一个经典的“编码错误”问题。
作为测试用例,如果我们通过 Python(在此示例中)读取帖子中的原始数据:
In [1]: import urllib
In [2]: u = urllib.request.urlopen('https://meta.discourse.org/posts/1418409/raw')
In [3]: r = u.read(); r
Out[3]: b'As described here (https://meta.discourse.org/t/embed-discourse-comments-on-another-website-via-javascript/31963/453), when embedding Discourse into my website, the title is correctly parsed. But as it contains umlauts, titles like \\xe2\\x80\\x9cIch w\\xc3\\xbcrde\\xe2\\x80\\x9d end up in \\xe2\\x80\\x9cIch w\\xc3\\x83\\xc2\\xbcrde\\xe2\\x80\\x9d.\\n\\nIs this a general problem, a problem with my page or any workaround for that? Thanks!'
我们得到的是字节,但不知道如何解码。然而,响应头之一告诉我们应该使用 UTF-8:
In [4]: u.headers['content-type']
Out[4]: 'text/plain; charset=utf-8'
In [5]: r.decode('utf-8')
Out[5]: 'As described here (https://meta.discourse.org/t/embed-discourse-comments-on-another-website-via-javascript/31963/453), when embedding Discourse into my website, the title is correctly parsed. But as it contains umlauts, titles like “Ich würde” end up in “Ich würde”.\\n\\nIs this a general problem, a problem with my page or any workaround for that? Thanks!'
In [6]: print(r.decode('utf-8'))
As described here (https://meta.discourse.org/t/embed-discourse-comments-on-another-website-via-javascript/31963/453), when embedding Discourse into my website, the title is correctly parsed. But as it contains umlauts, titles like “Ich würde” end up in “Ich würde”.
Is this a general problem, a problem with my page or any workaround for that? Thanks!
你会注意到字符与你发布时完全一样。但是,当对这些字节做出错误的解释时——尤其是当常见的错误是将这些字节解释为 ISO-8859-1 而不是 UTF-8(为清晰起见,字符串已缩短)时,你会得到:
In [7]: snippet = r[220:255]; snippet
Out[7]: b'titles like \\xe2\\x80\\x9cIch w\\xc3\\xbcrde\\xe2\\x80\\x9d end up'
In [8]: snippet.decode('utf-8')
Out[8]: 'titles like “Ich würde” end up'
In [9]: snippet.decode('iso-8859-1')
Out[9]: 'titles like â\x80\x9cIch würdeâ\x80\x9d end up'
如果我 print 那个,我的终端就会挂起。太神奇了。![]()
总而言之:你用来从 Discourse 中提取帖子数据的任何东西都将其视为 iso-8859-1 而不是 utf-8。
(推测)也许你正在将从 Discourse 站点提取的原始字节嵌入到一个以 iso-8859-1 代码页提供的页面中。
非常感谢你的提示。确实,UTF-8 的 meta 标签在 title 标签之后 ![]()
现在可以正常工作了!
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.