Censored words do not respect word boundaries in non-latin alphabet

A censored word: ебля (f…ck)

A word in post text: употреблять (To Consume)

Does that apply to English stuff too? Would Scunthorpe be censored?

Edit: nope, doesn’t appear to happen with English:

14%20PM

Update: it does happen if you tick the box to treat watched words as regular expressions under Admin->Settings->Posting. That’s to be expected, right?

58%20PM

2 Likes

A word will be censored based on this logic:

non-word character + censored word (which could include non-word characters) + non-word character

‘word character’ is currently defined by the regex \w metacharacter. Unfortunately this is simply “a-z A-Z 1-9 and _”

As @Stephen points out, you can toggle the “watched words as regular expressions” setting, and then define your own regular expression however you want. It is very tricky for us to have a single regular expression which works perfectly for word boundaries across all languages.

3 Likes

Yet it worked somehow before now, and I just noticed censoring in action after updating Discourse to the latest. Did something change in the settings that changed the default behaviour?

You might want to look how it is done in the autolinkify theme, where I was dealing with the same exact problem. (essentially listing non-eord chars by hand, exactly because \w does not handle non-lati alphabet.

2 Likes

Possible solutions for word-boundaries in non-latin characters:

Approach 1 - use sophisticated regex.

A few of them are listed here:

Approach 2 - Unicode Word Boundaries js library

http://unicode.org/reports/tr29/tr29-9.html#Word_Boundaries

https://github.com/wikimedia/unicodejs


Will any of that work?

I changed the behaviour slightly back in October

https://github.com/discourse/discourse/commit/3c2608d41c3e19c3037571b9102f73b743053fbc

If we censor ‘badword’, the regex you screenshotted wouldn’t be able to deal with something like parentheses:

(badword)

If there is a way to improve the behaviour without introducing another dependency, that would be great. The approach @danekhollas took in autolinkify is

  let leftWordBoundary = "(\\s|[\\([{]|^)";
  let rightWordBoundary = "([:.;,!?…\\]})]|\\s|$)";

I’m not a big fan of listing characters manually, but if it works it works :man_shrugging:. I’ll put it on my list to test this approach. If anyone else fancies giving it a try in the meantime, the change would be to roughly the same places I changed in this commit. The important thing is that pretty-text-test.js.es6 continues to pass.

5 Likes

Small note: I intentionally left out single and double quotes in those character lists which should be included in this use case.

2 Likes

[offtopic] I really appreciate everyeone’s involvement and prompt replies in a discussion of such a minor issue. Could not express my appreciation by simply linking posts :slight_smile:

5 Likes