"Body seems unclear" error when users are typing in chinese

sebastien · 2018 年5 月 30 日 02:21

Hi,

It seems that Discourse has some trouble dealing with Chinese characters. Our users cannot submit topics/posts if they use in chinese? In this case, I can see that it’s a long message but we still get the “Body seems unclear” message.

Any idea?

sam · 2018 年5 月 30 日 03:28

I see what is happening here.

We automatically disable this on Chinese forums but your forum appears to be English with a Chinese category.

Just set body min entropy to 0 in site settings.

sebastien · 2018 年5 月 30 日 06:52

Hum. Correction. It seems setting the body min entropy to 0 did not fix the issue. I tried with another text in Chinese and I still get the same error even though the body min entropy is set to 0

Did i miss something?

sebastien · 2018 年6 月 25 日 01:47

Hi,

Following up on this issue. I’m running some test with the latest version of discourse.

Body min entropy is set to 0. Same for Title min entropy.

When trying to create a topic with the body below I get the “Body unclear” error:

【澳門日報5月29日消息】國際會議協會(ICCA)日前發佈《二○一七年國際協會會議市場年度報告》。當中澳門多項評比的排位連續兩年均有上升，其中全球城市排名由一六年的七十二名躍升至第六十五名；亞太區域城市排名升一位至第十六，排名超過瑞士的日內瓦、澳大利亞的布里斯班、阿拉伯聯合酋長國的迪拜、韓國的釜山和濟州等城市

Is there a quick work around on this? My Chinese users are getting nervous because of this issue.

Thx
Seb

Shogo_Ochiai · 2018 年7 月 26 日 02:22

I’ve clarified this issue. But newbie is only able to put single picture on a post. So just a evidence and conclusion.

Conclusion, for both title and body

Validated: English capital letter ONLY
Validated: English capital letter AND (Chinese letter OR Japanese letter)
Succeeded: Chinese letter AND Japanese letter
Succeeded: English small letter AND (English Capital letter OR Chinese letter OR Japanese letter)

mentalstring · 2020 年12 月 3 日 11:25

抱歉再次打扰，但我们的论坛也遇到了同样的问题。我们的论坛主要以英语为主，但部分版块使用其他文字。将 body min entropy 设置为 0 并未解决此问题。

问题似乎在于某些拉丁字符的输入触发了全大写检查。以下是一条触发“正文似乎不清晰”提示的消息示例：

我看了一下，我8/15寄往俄罗斯的明信片10/13对方收到了，但是10/27寄的对方还没收到，现在已经36天了（不过同一批寄往不同国家的也没被收到）。
因为我是直接投的邮筒所以也不太清楚是不是寄不过去… 如果你在UCPC微信群里也许可以问下大家？

难道 allow uppercase posts 是唯一的解决方案吗？在我们这种以英语为主的论坛上，启用该选项并不理想，但我也能理解用户输入了有效消息却因脚本问题而遇到该错误时的沮丧。是否可以通过检查大写字符与正文长度的比例来解决这个问题？

Falco · 2020 年12 月 3 日 14:57

这正是它所做的，而在您的示例中，该比例为 100%。

当论坛的默认语言设置为中文时，我们会自动调整这些设置；但如果您在单个实例中混合使用多种语言，则需要手动调整该设置。

mentalstring · 2020 年12 月 3 日 22:53

如果文本中包含没有大小写变体的单字符（如中文），则该文本自动不被视为全大写。这可以通过匹配 /\p{Lo}/ 来实现，具体位置在此处。

这种方法无需为以 zh/ko/ja 为主的论坛进行特殊的设置调整，也能很好地适用于多语言混合使用的论坛，仅在仅使用“可转换为大写”的字符时才强制执行“允许大写”规则。

或许类似的逻辑也可用于优化现有的全大写检查：如果文本匹配 /\p{Ll}/（具有大写变体的小写字母），则该文本不是全大写。

Falco · 2020 年12 月 3 日 23:05

听起来是个很好的拉取请求（pull request）提议！

mentalstring · 2020 年12 月 3 日 23:12

我的 Ruby 功底几乎为零，但既然代码相对独立，我可以试着拼凑一下。

话虽如此，我在该文件的顶部看到了一个 TODO，似乎与这行代码直接相关。这是否只需简单移除 require 语句，还是应该由熟悉代码的人来提交这个 PR？

mentalstring · 2024 年7 月 12 日 16:39

我在 FIX: Allow all caps within CJK text by mentalstring · Pull Request #27900 · discourse/discourse · GitHub 中尝试了一下。

离 Ruby 开发者还差得很远，请多包涵。

zogstrip · 2024 年7 月 22 日 15:21

感谢 @mentalstring，我以你的 PR 为灵感，创建了

github.com/discourse/discourse

FIX: Allow all caps within CJK text

main ← fix-text-sentinel-for-non-latin-locales

opened 03:18PM - 22 Jul 24 UTC

ZogStriP

+19 -36

This improves the `TextSentinel` so that we don't consider CJK text as being upp…ercase and thus failing the validator. It also optimizes the entropy computation by using native ruby `.bytes` to get all the bytes from the text. It also tweaks the `seems_pronounceable?` and `seems_unpretentious?` check to use the `\p{Alnum}` unicode regexp group to account for non-latin languages. Reference - https://meta.discourse.org/t/body-seems-unclear-error-when-users-are-typing-in-chinese/88715 Inspired by https://github.com/discourse/discourse/pull/27900

该 PR 还包含了一些性能改进和对非拉丁语 locale 的更好处理

(抄送 @lindsey)

mentalstring · 2024 年7 月 22 日 17:59

很高兴看到这个问题得到了解决！我们运行一个国际论坛，虽然英语是主要语言，但我们有专门针对其他语言的版块，这长期以来一直令人烦恼。

既然 skipped_locale 现在仅用于 seems_unpretentious，我想知道我们是否可以跳过“ko”，因为现代韩语使用空格？请注意，我不会说韩语，所以您可能需要对此进行双重检查。

既然您有时间，我认为还有一件事可以轻松改进 TextSentinel，但我不敢尝试（同样，我不是 Ruby 开发者）。如果您有时间，我认为这相当简单，并且可以带来免费的性能提升。

github.com/discourse/discourse

lib/text_sentinel.rb

a267c0727


      
          def seems_unpretentious?
            skipped_locales.include?(SiteSetting.default_locale) || @opts[:max_word_length].nil? ||
              @text.scan(/\p{Alnum}+/).map(&:size).max.to_i <= @opts[:max_word_length]
          end

据我理解，这会通过将文本拆分为单词来检查单词是否超过长度限制，计算每个单词的长度，然后扫描所有长度以找到最大值，最后才将其与限制进行比较。

我们是否可以通过尝试将文本与类似 /\\p{Alnum}{#{max_word_length + 1},}/ 的内容进行匹配来跳过所有这些（语法可能不正确，但希望您能理解我的意思）？

在不了解 Ruby 内部工作原理的情况下，这更有可能在找到匹配项时立即停止检查，如果不存在过长的单词（最常见的情况），文本只会被扫描一次，跳过拆分、单独的单词长度检查等。

抱歉在此劫持了话题，但由于新的 PR 已经合并，我不确定在哪里发布此信息最好，因为它可能太小了，不值得开新话题，但似乎是一个简单的改进。请随意继续。

zogstrip · 2024 年7 月 23 日 14:12

我也不知道。但很想得到韩语使用者的确认。

这是一个绝妙的主意

mentalstring · 2024 年7 月 23 日 15:27

太棒了！感谢您抽出宝贵时间。

也许一位韩语翻译者（@9bow、@alexkoala、@changukshin ）可以确认现代韩语是否像罗马/拉丁文字一样在单词之间使用空格，这样 Discourse 就可以在处理韩语文本以查找过长单词时利用这个假设？

话题		回复	浏览量
How to disable "Body seems unclear, is it a complete sentence?" checker Support	18	6760	2022 年6 月 5 日
How to turn off a checker for "Title seems unclear, is it a complete sentence?" Support	19	9464	2020 年11 月 23 日
Receiving "body is unclear" error for Japanese text Bug	3	865	2018 年1 月 5 日
Body is invalid; try to be a little more descriptive - despite having changed all the settings I can find Support	12	2496	2019 年3 月 3 日
Error when replying with both Chinese and English Support	1	396	2020 年5 月 18 日

"Body seems unclear" error when users are typing in chinese

相关话题