Censored words do not respect word boundaries in non-latin alphabet

meglio · November 27, 2018, 11:56pm

A censored word: ебля (f…ck)

A word in post text: употреблять (To Consume)

Stephen · November 27, 2018, 11:57pm

Does that apply to English stuff too? Would Scunthorpe be censored?

Edit: nope, doesn’t appear to happen with English:

Update: it does happen if you tick the box to treat watched words as regular expressions under Admin->Settings->Posting. That’s to be expected, right?

david · November 28, 2018, 12:15am

A word will be censored based on this logic:

non-word character + censored word (which could include non-word characters) + non-word character

‘word character’ is currently defined by the regex \w metacharacter. Unfortunately this is simply “a-z A-Z 1-9 and _”

As @Stephen points out, you can toggle the “watched words as regular expressions” setting, and then define your own regular expression however you want. It is very tricky for us to have a single regular expression which works perfectly for word boundaries across all languages.

meglio · November 28, 2018, 12:38am

Yet it worked somehow before now, and I just noticed censoring in action after updating Discourse to the latest. Did something change in the settings that changed the default behaviour?

danekhollas · November 28, 2018, 1:39am

You might want to look how it is done in the autolinkify theme, where I was dealing with the same exact problem. (essentially listing non-eord chars by hand, exactly because \w does not handle non-lati alphabet.

meglio · November 28, 2018, 2:02am

Possible solutions for word-boundaries in non-latin characters:

Approach 1 - use sophisticated regex.

A few of them are listed here:

Approach 2 - Unicode Word Boundaries js library

http://unicode.org/reports/tr29/tr29-9.html#Word_Boundaries

https://github.com/wikimedia/unicodejs

Will any of that work?

david · November 28, 2018, 12:00pm

I changed the behaviour slightly back in October

https://github.com/discourse/discourse/commit/3c2608d41c3e19c3037571b9102f73b743053fbc

If we censor ‘badword’, the regex you screenshotted wouldn’t be able to deal with something like parentheses:

(badword)

If there is a way to improve the behaviour without introducing another dependency, that would be great. The approach @danekhollas took in autolinkify is

  let leftWordBoundary = "(\\s|[\\([{]|^)";
  let rightWordBoundary = "([:.;,!?…\\]})]|\\s|$)";

I’m not a big fan of listing characters manually, but if it works it works . I’ll put it on my list to test this approach. If anyone else fancies giving it a try in the meantime, the change would be to roughly the same places I changed in this commit. The important thing is that pretty-text-test.js.es6 continues to pass.

danekhollas · November 28, 2018, 12:37pm

Small note: I intentionally left out single and double quotes in those character lists which should be included in this use case.

meglio · November 29, 2018, 4:39am

[offtopic] I really appreciate everyeone’s involvement and prompt replies in a discussion of such a minor issue. Could not express my appreciation by simply linking posts

Topic		Replies	Views
Watched words regex: word boundary not working as expected Bug	5	1403	January 25, 2018
Censored pattern Bug	8	2202	January 12, 2018
Censor words should support sentence level censoring for Chinese Feature	8	1499	January 12, 2018
* wildcards in Watched Words (Censor) don't work Feature	20	3068	January 11, 2018
How to use Discourse regexes with watched words? Support	8	2340	June 29, 2019

Censored words do not respect word boundaries in non-latin alphabet

Related topics