Watched words tricks

While converting a big list of watched words to regular expressions I found some ways to circumvent the filters. These appear to work for both “normal” watched words as well as regular expressions.

Double spaces: if your watched word is forbidden word then this can be circumvented by placing multiple spaces inbetween the two words. Fun fact is that the cooked post will have the double space removed so it is totally invisible in the final text.

  • to prevent this using regular expressions: use forbidden\s*word
  • to prevent this without regular expressions: I did not find a solution.

Use underscores to work around word boundaries:
without regexes: if you surround a watched word by underscores then it will be printed in italics and it will be allowed. So _forbidden_ will be accepted if your filter is forbidden.
with regexes: normally word boundaries are only checked if you use \b , and then the underscore will beat them. So _forbidden_ will be accepted if yourfilter is \bforbidden\b.

  • to prevent this using regular expressions: use [\b\_] instead of \b
    EDIT this does not seem to work well.
    Removing the word boundaries might work as well but then you might risk accidentally disallowing words like cumulative and title :wink:
  • to prevent this without regular expressions: I did not find a solution.
6 Likes

Right, this is generally the kind of thing we don’t spend time fighting because there are a lot of “clever” tricks to circumvent any kind of word blocklist. Unicode is a big, big space.

4 Likes

Indeed, we tried doing this on a bunch of huge education projects a while back.

Before it was abandoned they went live with fuzzy matching, which predictably caused all kinds of problems for legitimate use cases.

3 Likes

Watched words are mostly a “first line of defense” against the bad words. You still need the community to flag the workarounds and violations.

No regex you ever devise will be able to detect a image.

9 Likes

Just to let you know, [] is for “character classes”. In Perl regular expressions, and possibly Ruby ones, \b is a “word boundary” outside of a character class and “backspace” inside of a character class. In C, ‘\b’ is always backspace (<control-H> to be precise). Backspace is not a useful character most of the time and word boundaries are, hence the redefinition.

To use a RE to catch “_forbidden_” or “forbidden” I’d probably use:

\b_?forbidden_?\b

(I also know how to &#xXX; encode all my letters to avoid Unicode tricks or the regular expression.)

3 Likes

I have never realized that there was a difference depending on the context. Thank you for explaining! :slight_smile:

2 Likes