Censored pattern

schungx · August 15, 2017, 12:25pm

@sam is correct in that \b doesn’t seem to match any Unicode, or any non-ASCII word breaks.

\w seems to be defined narrowly as [A-Za-z0-9_], probably just to parse source-code type texts. And \b is simply (\w\W|\W\w). So using \b has the net effect of turning any character outside simple ASCII letters/digits into white-space letters. There doesn’t seem to be an easy way out to deal with this.

An option to deal with this is to omit the \b wrapping altogether – a good idea because this will not work on any language outside English, which is quite restrictive if you ask. Not the entire world speaks English…

Put a warning on the regexp filter setting that uses must manually wrap their regexp’s in \b if they are dealing with strict English.

This has multiple benefits:

For English – anyone who is capable of entering a regexp string should know how to put in a pair of \b's
For W. European languages – i.e. the extended Latin set, they can put \b around all the words that contain ASCII endings, and do more precise filtering on words with non-ASCII endings/beginnings.
For CJK languages – do nothing, simply search character-for-character. CJK languages are mostly not written with strict white-spacing between words, so there is no point to artificially search for a word based on white-space surrounding the words because those white-spaces won’t be there. White-space is not used to delimiter words. In fact, for Chinese, Japanese and Chinese characters in Korean, words are not separated from one another; they stick together to form a single stream and there is nothing to break them apart other than context.
For other languages – e.g. Arabic etc. you are no worse off than without the \b wrapping. In fact, with the \b wrapping, the user can do nothing. Without them, the user can still do some filtering.

Topic		Replies	Views
Censor words should support sentence level censoring for Chinese Feature	8	1500	January 12, 2018
Censored words do not respect word boundaries in non-latin alphabet Bug pr-welcome	8	1491	November 29, 2018
Watched words regex: word boundary not working as expected Bug	5	1403	January 25, 2018
* wildcards in Watched Words (Censor) don't work Feature	20	3068	January 11, 2018
Watched words tricks Feature watched-words	5	1043	June 6, 2020

Censored pattern

Related topics