Censor words should support sentence level censoring for Chinese

fantasticfears · 3 أكتوبر 2017، 5:41ص

CJKV doesn’t have word boundaries. It’s more reliable to use the feature with a sentence level. In a word, support this feature without word boundaries.

Suggested: (Chinese) https://meta.discoursecn.org/t/topic/2175?u=fantasticfears

schungx · 3 أكتوبر 2017، 6:13ص

There is a discussion here:

schungx · 3 أكتوبر 2017، 6:35ص

If you can do a custom build of discourse, it is a simple matter to change that one line of code to remove the wrapping \b’s.

In the long term, I suggest removing them as default, or at least add a site setting for those of us running non-English forums.

pfaffman · 3 أكتوبر 2017، 1:47م

You could create a plugin to do that and/or submit a PR.

schungx · 3 أكتوبر 2017، 1:50م

Unfortunately a plugin requires quite a bit of Ruby knowledge. I can debug, but probably not even close to writing plugins.

A PR would require that I fork the entire repo, which is ok except I have no way to test it. It is bad form to submit a PR without testing…

Stranik · 3 أكتوبر 2017، 2:51م

There’s really not enough to remove one line. It is necessary to completely rewrite the logic file. I gave there a working version of the file (using loops).

schungx · 4 أكتوبر 2017، 4:17م

Well, not to remove the line, but to remove the \b’s in the line.

Regexp will never work for all languages with word breaks. The best you can do is to allow the user to decide which words require word breaks and which do not.

With the \b wrapper hard-coded in right now, there is no choice.

schungx · 11 يناير 2018، 5:50ص

This issue is now solved by:

To match Chinese patterns, turn on Settings > Posting > watched words regular expressions.

Beware, your Watched Words will now be raw regular expressions, so if your list includes English words, you’ll need to put in your own word break \b where necessary.

jomaxro · 12 يناير 2018، 11:00م

This topic was automatically closed after 40 hours. New replies are no longer allowed.

zogstrip · 16 فبراير 2026، 2:45م

يجب أن يكون لدينا الآن معالجة مناسبة لـ “حدود الكلمات” للغة الصينية بفضل

github.com/discourse/discourse

FIX: support CJK and spaceless scripts in watched word boundaries (#37844)

main ← fix/watched-words-cjk-boundaries

opened 02:44PM - 16 Feb 26 UTC

ZogStriP

+84 -9

Watched words failed to match in CJK (Chinese, Japanese, Korean) and other space…less scripts because word boundary detection relied on whitespace or non-word characters. Languages like Chinese don't use spaces between words, so "测试" inside "这是一个测试文本" was never matched. Introduce a SPACELESS_SCRIPTS constant covering Han, Hiragana, Katakana, Hangul, Thai, Lao, Myanmar, Khmer, and Tibetan Unicode ranges. Update `match_word_regexp` for both Ruby and JS engines so that characters from these scripts are treated as word boundaries. This allows a CJK watched word to match when surrounded by other CJK characters, and a Latin watched word to match when adjacent to CJK text (e.g., "Test" in "我的Test很好"), while still preventing partial Latin matches (e.g., "Testing" does not match "Test"). Also fix the admin watched word testing modal to use `RegExp.exec()` with capture group extraction instead of `String.match()`, since the new boundary patterns include a leading consuming group. Remove the outdated "non-chrome browsers do not support lookbehind" comment — all major browsers have supported lookbehind since 2023. https://meta.discourse.org/t/71288 https://meta.discourse.org/t/396109

الموضوع		الردود	مرات العرض
Censored words do not respect word boundaries in non-latin alphabet Bug	8	1561	29 نوفمبر 2018
Censored pattern Bug	7	2289	11 يناير 2018
* wildcards in Watched Words (Censor) don't work Feature	19	3262	11 يناير 2018
Hope Watched words adds support for non-English characters Bug	1	86	16 فبراير 2026
A closing round bracket breaks word censoring Bug	5	1471	13 سبتمبر 2017

Censor words should support sentence level censoring for Chinese

الموضوعات ذات الصلة