* wildcards in Watched Words (Censor) don't work


(Stephen Chung) #1

Repo:

  1. Make sure Site Settings > watched words regular expressions is OFF
  2. Go to Logs > Watched Words
  3. Type in somebadword* in Censor
  4. ./launcher restart app (this step appears to be necessary when the watched words list is changed, although I can’t understand why)
  5. Try to type in somebadword in the compose window. No effect.
  6. Check admin/watch_words.json and confirm that the word is accurately entered as somebadword*

This code should be replacing * with \S* (but I’m not sure if it will get lower-cased in the end). Somehow it is not matching.


(Stephen Chung) #2

Incidentally:

  1. Site Settings > watched words regular expressions is ON
  2. Go to Logs > Watched Words
  3. Type in somebadword\w* in Censor
  4. Confirm that admin/watched_words.json comes back with the correct somebadword\w*
  5. Try in compose window, no effect.

EDIT: It is Censor that is not working.


(Neil Lalonde) #3

Works on my machine…

It’s not necessary.

Are you doing this as a moderator or admin? If so, that explains why you aren’t seeing anything.


(Jeff Atwood) #4

Yes staff are immune to this by design. So not a bug.


(Stephen Chung) #5

Nope. Already tried it with a normal account. Everything that is wild-carded won’t get masked out.

Incidentally, if a word is included in censored pattern (note: this is in Settings, not in Watched Words), then it gets masked even when I’m admin. But this is beyond this question.

I’m running v1.9.0.beta17 +78. Should I be trying with the latest?


(Stephen Chung) #6

Site setup:

Normal user:

This site is running Beta15. I’ll upgrade to latest and report back.

EDIT: Sorry for not being clear, it is Censor that is not working for me. Not Block or Flag (haven’t tried those).

EDIT 2: OK, Upgraded to latest. Checked. Block and Flag both work fine. Only Censor is not working. I’ve updated the topic title.


#7

Hello!

I’m also having trouble using the Watched Words to prevent my community from not respecting TOS over topics and private messages.

Like @schungx, censored pattern from Site Settings is working great, but I’d rather use Watched Words with Required Approval to prevent users from trying to sneak around the regex I’m using.

However, I was only able to trigger the flagging system when creating topics and writing replies. Private messages just won’t trigger anything (Approval, Censor, Flag or Block).
This was tested on v2.0.0.beta1 +26 with 2 test accounts.


(Jeff Atwood) #8

Can you repro this @neil?


(Neil Lalonde) #9

I can only confirm that censoring a word like somebadword* doesn’t work.

Off topic, but word watching works in pm’s for me.

@thethirdpudding Please open a new topic in #support to explain what you’re doing.


(Neil Lalonde) #11

I fixed this today. Wildcards should work in censored words now.


(Stephen Chung) #12

I’m sorry, but not so fast.

Watched Words Censor now works for words with wildcards.

However, it doesn’t work if watched_words_regular_expression is true.

I don’t think the censor function even considers it as regular expression at all.

Repo:

  1. Settings > watched words regular expression ==> ON
  2. Add xyz* to Censor
  3. In compose window, type xyz123
  4. See that it is censored as it is treating the * as a wildcard. If it is treated as a regular expression, xyz* should only match xy followed by a string of zzzz

EDIT: The below is the culprit…

Notice that it is not even considering that the pattern may already be a regular expression.

Also,

This is always assuming that the pattern is a word pattern and \b pairs are auto-wrapped onto it. If the pattern is a regular expression, obviously the \b pairs can be omitted because the user should put them in himself.


(Tom Newsom) #13

Wildcards in censor lists can be problematic…

image

image


(Stephen Chung) #14

Well, if you put in *shit* then obviously that is what you’ll get… Since you are explicitly asking the system to filter out anything containing these words.

Usually you’ll be using shit* for example…

But of course, it won’t be 100% fool-proof if you use any wildcard. For example:

I love shitaki mushrooms!


(Neil Lalonde) #15

@eviltrout is this setting meant to force every word to explicitly include the word boundaries in the patterns? This feature was added for a specific customer, so removing the \b around the patterns could have… surprising consequences!

Agreed. Censor wasn’t updated to support watched_words_regular_expression, so I’ll need to implement it.


(Robin Ward) #16

Yes this was intentional. If you write the regular expression yourself you can control whether it’s on a boundary or not. Some of the watched words we imported were not for example! It’s a power feature.


(Neil Lalonde) #17

@schungx It should work now. Plz update and try again.


(Stephen Chung) #18

Yup. It makes sense to wrap with \b for regular mode since it is simpler and makes sense (at least for English). A small pitfall is that it screws up on non-ASCII letters, but that’s a small issue comparatively speaking.

When a site turns on regular expression, you assume that the admin knows what he/she is doing and write correct regexp’s. Then those \b will be an unnecessary limitation.


(Mittineague) #19

I’m thinking it might be better to not deal with POSIX regex at all and limit it to PostgresSQL wildcards (_ %)

IMHO, assuming that an Admin that wants regex will know regex will in most cases be quite a leap. Even devs that have advanced programming skills in general can have problems getting regex right.


(Stephen Chung) #20

Not quite.

First of all, I believe the markdown processing in Discourse is actually done via JavaScript so it is natural to use JS regex.

Secondly, there are tons of online tools to check regex’s.

Thirdly, common regex’s are not difficult. The difficult ones are trying to make regex do what it wasn’t meant to do. Most of the normal scenarios are actually quite simple.


(Stephen Chung) #21

Prelim testing shows that it is working perfectly fine! Good job!

Now finally I can censor Chinese! :tada:


Censored pattern
Censor words should support sentence level censoring for Chinese