Erreur 'Body semble indéfini' lorsque les utilisateurs tapent en chinois

sebastien · 2018 年 5 月 30 日午前 2:21

Hi,

It seems that Discourse has some trouble dealing with Chinese characters. Our users cannot submit topics/posts if they use in chinese? In this case, I can see that it’s a long message but we still get the “Body seems unclear” message.

Any idea?

sam · 2018 年 5 月 30 日午前 3:28

I see what is happening here.

We automatically disable this on Chinese forums but your forum appears to be English with a Chinese category.

Just set body min entropy to 0 in site settings.

sebastien · 2018 年 5 月 30 日午前 6:52

Hum. Correction. It seems setting the body min entropy to 0 did not fix the issue. I tried with another text in Chinese and I still get the same error even though the body min entropy is set to 0

Did i miss something?

sebastien · 2018 年 6 月 25 日午前 1:47

Hi,

Following up on this issue. I’m running some test with the latest version of discourse.

Body min entropy is set to 0. Same for Title min entropy.

When trying to create a topic with the body below I get the “Body unclear” error:

【澳門日報5月29日消息】國際會議協會(ICCA)日前發佈《二○一七年國際協會會議市場年度報告》。當中澳門多項評比的排位連續兩年均有上升，其中全球城市排名由一六年的七十二名躍升至第六十五名；亞太區域城市排名升一位至第十六，排名超過瑞士的日內瓦、澳大利亞的布里斯班、阿拉伯聯合酋長國的迪拜、韓國的釜山和濟州等城市

Is there a quick work around on this? My Chinese users are getting nervous because of this issue.

Thx
Seb

Shogo_Ochiai · 2018 年 7 月 26 日午前 2:22

I’ve clarified this issue. But newbie is only able to put single picture on a post. So just a evidence and conclusion.

Conclusion, for both title and body

Validated: English capital letter ONLY
Validated: English capital letter AND (Chinese letter OR Japanese letter)
Succeeded: Chinese letter AND Japanese letter
Succeeded: English small letter AND (English Capital letter OR Chinese letter OR Japanese letter)

mentalstring · 2020 年 12 月 3 日午前 11:25

Sorry for reviving this, but we have hit the same issue on our Forum which is primarily in English, but some sections in other scripts. Setting body min entropy to 0 did not fix this.

The issue seems to be that the use of some latin characters trips the all caps check. Here’s an example of a message that bumps into the Body seems unclear notice:

我看了一下，我8/15寄往俄罗斯的明信片10/13对方收到了，但是10/27寄的对方还没收到，现在已经36天了（不过同一批寄往不同国家的也没被收到）。
因为我是直接投的邮筒所以也不太清楚是不是寄不过去… 如果你在UCPC微信群里也许可以问下大家？

Is the allow uppercase posts the only solution here? On forums like ours where English is the main language, enabling that is not ideal, but I can also understand the frustration of users entering a valid message in their script bumping into that error. Could checking the ratio of CAPS versus the size of the body help here?

Falco · 2020 年 12 月 3 日午後 2:57

That is what it does and in you example the ratio is 100%.

When a forum default language is set to Chinese we tweak those settings automatically, but if you have mixed languages in a single instance you need to tweak that setting.

mentalstring · 2020 年 12 月 3 日午後 10:53

If the text has a single letter character that has no upper/lower case variant (like with Chinese), then the text is automatically not all uppercase. This could be checked by matching against /\p{Lo}/ in here.

This approach would not require a special setting tweak for forums primarily in zh//ko/ja and can also play well with forums where mixed languages are used, only enforcing the allow upper case where only uppercase-able characters are used.

Maybe a similar logic could also be used to optimize the existing check for all caps: if the text matches /\p{Ll}/ (lowercase letter that has an uppercase variant), then the text is not all caps.

Falco · 2020 年 12 月 3 日午後 11:05

Sounds like a good idea for a pull request!

mentalstring · 2020 年 12 月 3 日午後 11:12

My Ruby chops are nearly non existent, but I can try to put something together as it is somewhat contained.

With that said, I’m seeing a TODO at the top of that file which seems related with this precise line of code. Is it as simple as remove the require, or should someone that knows what they are doing go for this PR?

mentalstring · 2024 年 7 月 12 日午後 4:39

FIX: Allow all caps within CJK text by mentalstring · Pull Request #27900 · discourse/discourse · GitHub で試してみました。

まだ Ruby 開発者には程遠いので、ご容赦ください。

zogstrip · 2024 年 7 月 22 日午後 3:21

@mentalstringさん、ありがとうございます。あなたのPRを参考に、以下のPRを作成しました。

github.com/discourse/discourse

FIX: Allow all caps within CJK text

main ← fix-text-sentinel-for-non-latin-locales

opened 03:18PM - 22 Jul 24 UTC

ZogStriP

+19 -36

This improves the `TextSentinel` so that we don't consider CJK text as being upp…ercase and thus failing the validator. It also optimizes the entropy computation by using native ruby `.bytes` to get all the bytes from the text. It also tweaks the `seems_pronounceable?` and `seems_unpretentious?` check to use the `\p{Alnum}` unicode regexp group to account for non-latin languages. Reference - https://meta.discourse.org/t/body-seems-unclear-error-when-users-are-typing-in-chinese/88715 Inspired by https://github.com/discourse/discourse/pull/27900

これには、パフォーマンスの向上や、ラテン文字以外のロケール処理の改善も含まれています

（@lindsey さんにもCC）

mentalstring · 2024 年 7 月 22 日午後 5:59

素晴らしいですね！私たちは国際フォーラムを運営しており、英語が主な言語ですが、他の言語専用のカテゴリもあり、これは長年の悩みの種でした。

skipped_locale が seems_unpretentious にのみ使用されるようになったので、現代の韓国語はスペースを使用するため、「ko」をスキップできるのではないかと思います。韓国語は話せないので、確認が必要かもしれません。

お忙しいところ恐縮ですが、もう一つ、TextSentinel で簡単に改善できると思う点があります。触る勇気がありませんでしたが（私はRuby開発者ではありません）。もしお時間があれば、かなりシンプルで無料のパフォーマンス向上が得られると思います。

github.com/discourse/discourse

lib/text_sentinel.rb

a267c0727


      
          def seems_unpretentious?
            skipped_locales.include?(SiteSetting.default_locale) || @opts[:max_word_length].nil? ||
              @text.scan(/\p{Alnum}+/).map(&:size).max.to_i <= @opts[:max_word_length]
          end

私の理解では、これはテキストを単語に分割し、各単語の長さを計算し、すべての長さをスキャンして最大値を見つけ、最後にそれを制限と比較することで、単語が制限よりも長いかどうかを確認しています。

おそらく構文は間違っていますが、意図は伝わると思いますが、テキストを /\\p{Alnum}{#{max_word_length + 1},}/ のようなものに一致させるだけで、これらすべてをスキップできるのではないでしょうか？

Rubyの内部構造を知らないため、これは一致が見つかった時点でチェックを停止する可能性が高く、長すぎる単語がない場合（最も一般的なケース）は、テキストは一度だけスキャンされ、分割、個々の単語の長さチェックなどはスキップされます。

トピックを乗っ取ってしまって申し訳ありませんが、新しいPRはすでにマージされているため、新しいトピックを立てるほどではないかもしれませんが、簡単な勝利のように思えるので、どこに投稿するのが最善かわかりません。ご自由に進めてください。

zogstrip · 2024 年 7 月 23 日午後 2:12

私も全くわかりません。韓国語話者からの確認をいただけると嬉しいです。

それは素晴らしいアイデアですね

mentalstring · 2024 年 7 月 23 日午後 3:27

やったー！お時間をいただきありがとうございます。

韓国語翻訳者の方々（@9bow、@alexkoala、@changukshin ）のどなたか、現代の韓国語はローマ字/ラテン文字と同様に単語間にスペースを使用していることを確認していただけますか？そうすれば、Discourse は韓国語のテキストを処理して長すぎる単語を見つける際にその仮定を使用できますか？

トピック		返信	表示
How to disable "Body seems unclear, is it a complete sentence?" checker Support	18	6681	2022 年 6 月 5 日
How to turn off a checker for "Title seems unclear, is it a complete sentence?" Support	19	9398	2020 年 11 月 23 日
Receiving "body is unclear" error for Japanese text Bug	3	855	2018 年 1 月 5 日
Body is invalid; try to be a little more descriptive - despite having changed all the settings I can find Support	12	2464	2019 年 3 月 3 日
Error when replying with both Chinese and English Support	1	379	2020 年 5 月 18 日

Erreur 'Body semble indéfini' lorsque les utilisateurs tapent en chinois

関連トピック