It seems that Discourse has some trouble dealing with Chinese characters. Our users cannot submit topics/posts if they use in chinese? In this case, I can see that it’s a long message but we still get the “Body seems unclear” message.
Hum. Correction. It seems setting the body min entropy to 0 did not fix the issue. I tried with another text in Chinese and I still get the same error even though the body min entropy is set to 0
Sorry for reviving this, but we have hit the same issue on our Forum which is primarily in English, but some sections in other scripts. Setting body min entropy to 0 did not fix this.
The issue seems to be that the use of some latin characters trips the all caps check. Here’s an example of a message that bumps into the Body seems unclear notice:
Is the allow uppercase posts the only solution here? On forums like ours where English is the main language, enabling that is not ideal, but I can also understand the frustration of users entering a valid message in their script bumping into that error. Could checking the ratio of CAPS versus the size of the body help here?
That is what it does and in you example the ratio is 100%.
When a forum default language is set to Chinese we tweak those settings automatically, but if you have mixed languages in a single instance you need to tweak that setting.
If the text has a single letter character that has no upper/lower case variant (like with Chinese), then the text is automatically not all uppercase. This could be checked by matching against /\p{Lo}/ in here.
This approach would not require a special setting tweak for forums primarily in zh//ko/ja and can also play well with forums where mixed languages are used, only enforcing the allow upper case where only uppercase-able characters are used.
Maybe a similar logic could also be used to optimize the existing check for all caps: if the text matches /\p{Ll}/ (lowercase letter that has an uppercase variant), then the text is not all caps.
My Ruby chops are nearly non existent, but I can try to put something together as it is somewhat contained.
With that said, I’m seeing a TODO at the top of that file which seems related with this precise line of code. Is it as simple as remove the require, or should someone that knows what they are doing go for this PR?
Great to seeing this addressed! We run an international forum and while English is the main language, we have categories dedicated to other languages and this has been a long term annoyance.
Now that skipped_locale is only used for seems_unpretentious, I’m wondering if we may skip ‘ko’ since modern Korean uses spaces? Mind that I don’t speak Korean, so you may want to double check on this.
While I have your attention there’s one more thing that I think could be an easy improvement on TextSentinel but didn’t dare touching (again, not a Ruby developer). If you have a moment, I think it’s fairly simple and could get a free performance gain.
As I understand, this checks if a word is longer than the limit by splitting text into words, calculates the length of each one, scans all lengths to the find the highest, and only then compare that with the limit.
Could we perhaps skip all that by just trying to match the text against something like /\p{Alnum}{#{max_word_length + 1},}/ (syntax likely wrong, but hopefully you get the idea)?
Without knowing the inner-workings of Ruby, this is more likely to stop the check as soon as there’s a match, and if there’s no too long word (most common case), the text is only scanned once, skipping the splitting, the individual word length check, etc.
Sorry if I’m hijacking the topic here, but as the new PR is already merged, I’m not sure the best place to post this as it’s perhaps too small to deserve a new topic, but seems like an easy win. Feel free to run with it.
Maybe one of the Korean translators (/cc @9bow, @alexkoala, @changukshin ) can confirm that modern Korean uses spaces between words similar to Roman/Latin scripts, so that Discourse can use that assumption on processing Korean text to find too long words?