While converting a big list of watched words to regular expressions I found some ways to circumvent the filters. These appear to work for both “normal” watched words as well as regular expressions.
Double spaces: if your watched word is forbidden word then this can be circumvented by placing multiple spaces inbetween the two words. Fun fact is that the cooked post will have the double space removed so it is totally invisible in the final text.
to prevent this using regular expressions: use forbidden\s*word
to prevent this without regular expressions: I did not find a solution.
Use underscores to work around word boundaries:
without regexes: if you surround a watched word by underscores then it will be printed in italics and it will be allowed. So _forbidden_ will be accepted if your filter is forbidden.
with regexes: normally word boundaries are only checked if you use \b , and then the underscore will beat them. So _forbidden_ will be accepted if yourfilter is \bforbidden\b.
to prevent this using regular expressions: use [\b\_] instead of \b EDIT this does not seem to work well.
Removing the word boundaries might work as well but then you might risk accidentally disallowing words like cumulative and title
to prevent this without regular expressions: I did not find a solution.
Right, this is generally the kind of thing we don’t spend time fighting because there are a lot of “clever” tricks to circumvent any kind of word blocklist. Unicode is a big, big space.
Just to let you know, [] is for “character classes”. In Perl regular expressions, and possibly Ruby ones, \b is a “word boundary” outside of a character class and “backspace” inside of a character class. In C, ‘\b’ is always backspace (<control-H> to be precise). Backspace is not a useful character most of the time and word boundaries are, hence the redefinition.
To use a RE to catch “_forbidden_” or “forbidden” I’d probably use:
\b_?forbidden_?\b
(I also know how to &#xXX; encode all my letters to avoid Unicode tricks or the regular expression.)