CJKV doesn’t have word boundaries. It’s more reliable to use the feature with a sentence level. In a word, support this feature without word boundaries.
Suggested: (Chinese) 推荐主题、 危禁词要是支持中文就好了 - 支持 - Discourse中文论坛
إعجاب واحد (1)
schungx
(Stephen Chung)
3 أكتوبر 2017، 6:13ص
2
There is a discussion here:
@sam is correct in that \b doesn’t seem to match any Unicode, or any non-ASCII word breaks.
\w seems to be defined narrowly as [A-Za-z0-9_], probably just to parse source-code type texts. And \b is simply (\w\W|\W\w). So using \b has the net effect of turning any character outside simple ASCII letters/digits into white-space letters. There doesn’t seem to be an easy way out to deal with this.
An option to deal with this is to omit the \b wrapping altogether – a good idea because this will n…
4 إعجابات
schungx
(Stephen Chung)
3 أكتوبر 2017، 6:35ص
4
If you can do a custom build of discourse, it is a simple matter to change that one line of code to remove the wrapping \b's.
In the long term, I suggest removing them as default, or at least add a site setting for those of us running non-English forums.
إعجاب واحد (1)
pfaffman
(Jay Pfaffman)
3 أكتوبر 2017، 1:47م
5
You could create a plugin to do that and/or submit a PR.
schungx
(Stephen Chung)
3 أكتوبر 2017، 1:50م
6
Unfortunately a plugin requires quite a bit of Ruby knowledge. I can debug, but probably not even close to writing plugins.
A PR would require that I fork the entire repo, which is ok except I have no way to test it. It is bad form to submit a PR without testing…
إعجاب واحد (1)
Stranik
(Evgeny)
3 أكتوبر 2017، 2:51م
7
There’s really not enough to remove one line. It is necessary to completely rewrite the logic file. I gave there a working version of the file (using loops).
schungx
(Stephen Chung)
4 أكتوبر 2017، 4:17م
8
Well, not to remove the line, but to remove the \b's in the line.
Regexp will never work for all languages with word breaks. The best you can do is to allow the user to decide which words require word breaks and which do not.
With the \b wrapper hard-coded in right now, there is no choice.
schungx
(Stephen Chung)
11 يناير 2018، 5:50ص
9
This issue is now solved by:
Prelim testing shows that it is working perfectly fine! Good job!
Now finally I can censor Chinese!
To match Chinese patterns, turn on Settings > Posting > watched words regular expressions.
Beware, your Watched Words will now be raw regular expressions, so if your list includes English words, you’ll need to put in your own word break \b where necessary.
6 إعجابات
jomaxro
(Joshua Rosenfeld)
تم إغلاقه في
12 يناير 2018، 11:00م
10
This topic was automatically closed after 40 hours. New replies are no longer allowed.
يجب أن يكون لدينا الآن معالجة مناسبة لـ “حدود الكلمات” للغة الصينية بفضل
main ← fix/watched-words-cjk-boundaries
opened 02:44PM - 16 Feb 26 UTC
Watched words failed to match in CJK (Chinese, Japanese, Korean) and other space… less scripts because word boundary detection relied on whitespace or non-word characters. Languages like Chinese don't use spaces between words, so "测试" inside "这是一个测试文本" was never matched.
Introduce a SPACELESS_SCRIPTS constant covering Han, Hiragana, Katakana, Hangul, Thai, Lao, Myanmar, Khmer, and Tibetan Unicode ranges. Update `match_word_regexp` for both Ruby and JS engines so that characters from these scripts are treated as word boundaries. This allows a CJK watched word to match when surrounded by other CJK characters, and a Latin watched word to match when adjacent to CJK text (e.g., "Test" in "我的Test很好"), while still preventing partial Latin matches (e.g., "Testing" does not match "Test").
Also fix the admin watched word testing modal to use `RegExp.exec()` with capture group extraction instead of `String.match()`, since the new boundary patterns include a leading consuming group.
Remove the outdated "non-chrome browsers do not support lookbehind" comment — all major browsers have supported lookbehind since 2023.
https://meta.discourse.org/t/71288
https://meta.discourse.org/t/396109