Censored pattern

@sam is correct in that \b doesn’t seem to match any Unicode, or any non-ASCII word breaks.

\w seems to be defined narrowly as [A-Za-z0-9_], probably just to parse source-code type texts. And \b is simply (\w\W|\W\w). So using \b has the net effect of turning any character outside simple ASCII letters/digits into white-space letters. There doesn’t seem to be an easy way out to deal with this.

An option to deal with this is to omit the \b wrapping altogether – a good idea because this will not work on any language outside English, which is quite restrictive if you ask. Not the entire world speaks English…

Put a warning on the regexp filter setting that uses must manually wrap their regexp’s in \b if they are dealing with strict English.

This has multiple benefits:

  1. For English – anyone who is capable of entering a regexp string should know how to put in a pair of \b's

  2. For W. European languages – i.e. the extended Latin set, they can put \b around all the words that contain ASCII endings, and do more precise filtering on words with non-ASCII endings/beginnings.

  3. For CJK languages – do nothing, simply search character-for-character. CJK languages are mostly not written with strict white-spacing between words, so there is no point to artificially search for a word based on white-space surrounding the words because those white-spaces won’t be there. White-space is not used to delimiter words. In fact, for Chinese, Japanese and Chinese characters in Korean, words are not separated from one another; they stick together to form a single stream and there is nothing to break them apart other than context.

  4. For other languages – e.g. Arabic etc. you are no worse off than without the \b wrapping. In fact, with the \b wrapping, the user can do nothing. Without them, the user can still do some filtering.

4 Likes