@sam is correct in that \b doesn’t seem to match any Unicode, or any non-ASCII word breaks.
\w seems to be defined narrowly as [A-Za-z0-9_], probably just to parse source-code type texts. And \b is simply (\w\W|\W\w). So using \b has the net effect of turning any character outside simple ASCII letters/digits into white-space letters. There doesn’t seem to be an easy way out to deal with this.
An option to deal with this is to omit the \b wrapping altogether – a good idea because this will not work on any language outside English, which is quite restrictive if you ask. Not the entire world speaks English…
Put a warning on the regexp filter setting that uses must manually wrap their regexp’s in \b if they are dealing with strict English.
This has multiple benefits:
-
For English – anyone who is capable of entering a regexp string should know how to put in a pair of
\b's -
For W. European languages – i.e. the extended Latin set, they can put
\baround all the words that contain ASCII endings, and do more precise filtering on words with non-ASCII endings/beginnings. -
For CJK languages – do nothing, simply search character-for-character. CJK languages are mostly not written with strict white-spacing between words, so there is no point to artificially search for a word based on white-space surrounding the words because those white-spaces won’t be there. White-space is not used to delimiter words. In fact, for Chinese, Japanese and Chinese characters in Korean, words are not separated from one another; they stick together to form a single stream and there is nothing to break them apart other than context.
-
For other languages – e.g. Arabic etc. you are no worse off than without the
\bwrapping. In fact, with the\bwrapping, the user can do nothing. Without them, the user can still do some filtering.