* wildcards in Watched Words (Censor) don't work

Repo:

  1. Make sure Site Settings > watched words regular expressions is OFF
  2. Go to Logs > Watched Words
  3. Type in somebadword* in Censor
  4. ./launcher restart app (this step appears to be necessary when the watched words list is changed, although I can’t understand why)
  5. Try to type in somebadword in the compose window. No effect.
  6. Check admin/watch_words.json and confirm that the word is accurately entered as somebadword*

https://github.com/discourse/discourse/blob/37854299488e47a7eac818e577c65e9431501b46/app/services/word_watcher.rb#L28-L35

This code should be replacing * with \S* (but I’m not sure if it will get lower-cased in the end). Somehow it is not matching.

3 Likes

Incidentally:

  1. Site Settings > watched words regular expressions is ON
  2. Go to Logs > Watched Words
  3. Type in somebadword\w* in Censor
  4. Confirm that admin/watched_words.json comes back with the correct somebadword\w*
  5. Try in compose window, no effect.

EDIT: It is Censor that is not working.

Works on my machine…

It’s not necessary.

Are you doing this as a moderator or admin? If so, that explains why you aren’t seeing anything.

2 Likes

Yes staff are immune to this by design. So not a bug.

Nope. Already tried it with a normal account. Everything that is wild-carded won’t get masked out.

Incidentally, if a word is included in censored pattern (note: this is in Settings, not in Watched Words), then it gets masked even when I’m admin. But this is beyond this question.

I’m running v1.9.0.beta17 +78. Should I be trying with the latest?

Site setup:

Normal user:

This site is running Beta15. I’ll upgrade to latest and report back.

EDIT: Sorry for not being clear, it is Censor that is not working for me. Not Block or Flag (haven’t tried those).

EDIT 2: OK, Upgraded to latest. Checked. Block and Flag both work fine. Only Censor is not working. I’ve updated the topic title.

3 Likes

Hello!

I’m also having trouble using the Watched Words to prevent my community from not respecting TOS over topics and private messages.

Like @schungx, censored pattern from Site Settings is working great, but I’d rather use Watched Words with Required Approval to prevent users from trying to sneak around the regex I’m using.

However, I was only able to trigger the flagging system when creating topics and writing replies. Private messages just won’t trigger anything (Approval, Censor, Flag or Block).
This was tested on v2.0.0.beta1 +26 with 2 test accounts.

Can you repro this @neil?

I can only confirm that censoring a word like somebadword* doesn’t work.

Off topic, but word watching works in pm’s for me.

@thethirdpudding Please open a new topic in #support to explain what you’re doing.

3 Likes

I fixed this today. Wildcards should work in censored words now.

6 Likes

I’m sorry, but not so fast.

Watched Words Censor now works for words with wildcards.

However, it doesn’t work if watched_words_regular_expression is true.

I don’t think the censor function even considers it as regular expression at all.

Repo:

  1. Settings > watched words regular expression ==> ON
  2. Add xyz* to Censor
  3. In compose window, type xyz123
  4. See that it is censored as it is treating the * as a wildcard. If it is treated as a regular expression, xyz* should only match xy followed by a string of zzzz

EDIT: The below is the culprit…

https://github.com/discourse/discourse/blob/ad62f1099cdb8782d20ca1296ea73467bc35fce7/app/assets/javascripts/pretty-text/censored-words.js.es6#L11-L13

Notice that it is not even considering that the pattern may already be a regular expression.

Also,

https://github.com/discourse/discourse/blob/ad62f1099cdb8782d20ca1296ea73467bc35fce7/app/assets/javascripts/pretty-text/censored-words.js.es6#L23

This is always assuming that the pattern is a word pattern and \b pairs are auto-wrapped onto it. If the pattern is a regular expression, obviously the \b pairs can be omitted because the user should put them in himself.

2 Likes

Wildcards in censor lists can be problematic…

image

image

Well, if you put in *shit* then obviously that is what you’ll get… Since you are explicitly asking the system to filter out anything containing these words.

Usually you’ll be using shit* for example…

But of course, it won’t be 100% fool-proof if you use any wildcard. For example:

I love shitaki mushrooms!

@eviltrout is this setting meant to force every word to explicitly include the word boundaries in the patterns? This feature was added for a specific customer, so removing the \b around the patterns could have… surprising consequences!

Agreed. Censor wasn’t updated to support watched_words_regular_expression, so I’ll need to implement it.

1 Like

Yes this was intentional. If you write the regular expression yourself you can control whether it’s on a boundary or not. Some of the watched words we imported were not for example! It’s a power feature.

5 Likes

@schungx It should work now. Plz update and try again.

4 Likes

Yup. It makes sense to wrap with \b for regular mode since it is simpler and makes sense (at least for English). A small pitfall is that it screws up on non-ASCII letters, but that’s a small issue comparatively speaking.

When a site turns on regular expression, you assume that the admin knows what he/she is doing and write correct regexp’s. Then those \b will be an unnecessary limitation.

I’m thinking it might be better to not deal with POSIX regex at all and limit it to PostgresSQL wildcards (_ %)

IMHO, assuming that an Admin that wants regex will know regex will in most cases be quite a leap. Even devs that have advanced programming skills in general can have problems getting regex right.

Not quite.

First of all, I believe the markdown processing in Discourse is actually done via JavaScript so it is natural to use JS regex.

Secondly, there are tons of online tools to check regex’s.

Thirdly, common regex’s are not difficult. The difficult ones are trying to make regex do what it wasn’t meant to do. Most of the normal scenarios are actually quite simple.

1 Like

Prelim testing shows that it is working perfectly fine! Good job!

Now finally I can censor Chinese! :tada:

6 Likes