Watched words tricks

RGJ · June 5, 2020, 8:34am

While converting a big list of watched words to regular expressions I found some ways to circumvent the filters. These appear to work for both “normal” watched words as well as regular expressions.

Double spaces: if your watched word is forbidden word then this can be circumvented by placing multiple spaces inbetween the two words. Fun fact is that the cooked post will have the double space removed so it is totally invisible in the final text.

to prevent this using regular expressions: use forbidden\s*word
to prevent this without regular expressions: I did not find a solution.

Use underscores to work around word boundaries:
without regexes: if you surround a watched word by underscores then it will be printed in italics and it will be allowed. So _forbidden_ will be accepted if your filter is forbidden.
with regexes: normally word boundaries are only checked if you use \b , and then the underscore will beat them. So _forbidden_ will be accepted if yourfilter is \bforbidden\b.

to prevent this using regular expressions: use [\b\_] instead of \b
EDIT this does not seem to work well.
Removing the word boundaries might work as well but then you might risk accidentally disallowing words like cumulative and title
to prevent this without regular expressions: I did not find a solution.

codinghorror · June 5, 2020, 9:39pm

Right, this is generally the kind of thing we don’t spend time fighting because there are a lot of “clever” tricks to circumvent any kind of word blocklist. Unicode is a big, big space.

Stephen · June 5, 2020, 9:57pm

Indeed, we tried doing this on a bunch of huge education projects a while back.

Before it was abandoned they went live with fuzzy matching, which predictably caused all kinds of problems for legitimate use cases.

riking · June 5, 2020, 11:22pm

Watched words are mostly a “first line of defense” against the bad words. You still need the community to flag the workarounds and violations.

No regex you ever devise will be able to detect a .

elijah · June 6, 2020, 6:52am

Just to let you know, [] is for “character classes”. In Perl regular expressions, and possibly Ruby ones, \b is a “word boundary” outside of a character class and “backspace” inside of a character class. In C, ‘\b’ is always backspace (<control-H> to be precise). Backspace is not a useful character most of the time and word boundaries are, hence the redefinition.

To use a RE to catch “_forbidden_” or “forbidden” I’d probably use:

\b_?forbidden_?\b

(I also know how to &#xXX; encode all my letters to avoid Unicode tricks or the regular expression.)

RGJ · June 6, 2020, 7:27am

I have never realized that there was a difference depending on the context. Thank you for explaining!

Topic		Replies	Views
Watched Words - Is there a way to block the use of asterisks with certain words instead of it being used as a wildcard? Support watched-words	12	120	August 16, 2024
* wildcards in Watched Words (Censor) don't work Feature	20	3068	January 11, 2018
Watched words regex: word boundary not working as expected Bug	5	1403	January 25, 2018
Using Regex with Watched Words Site Management reference , regex , watched-words , content	1	1606	June 14, 2024
Invalid regular expressions in 'Watched Words' makes no watched word work Bug watched-words	2	597	May 31, 2021

Watched words tricks

Related topics