Censored pattern

Stranik · July 27, 2017, 9:40am

1. Where to find words? (how to delete, etc.)
2. Russian characters not working

(v1.9.0.beta4 +317 1.9.0.beta4)

sam · July 28, 2017, 3:21pm

This does not look like a regression, I think it never worked:

This regex does not work with Cyrillic .

Anyone have any ideas on how to rewrite:

censorRegexp = new RegExp("(\\b(?:(hello)|(Здравствуйте))\\b)(?![^\\(]*\\))", "ig");

So it does not use \b which is unsupported Russian.

(be sure to reply here with a tested regex I have already seen the posts on Stack Overflow)

gerhard · July 28, 2017, 5:01pm

That’s a tough one.
I don’t think this can be solved with a regex.

http://unicode.org/reports/tr29/tr29-9.html#Word_Boundaries

schungx · August 15, 2017, 6:55am

OMG I have been trying to find this setting for ages. Almost went ahead to create a dumb topic here just to ask where it is.

Turns out that it is tucked under “Logs”… probably the last place I’d expect to look for it!

Probably should file it under regular “Settings”.

schungx · August 15, 2017, 12:25pm

@sam is correct in that \b doesn’t seem to match any Unicode, or any non-ASCII word breaks.

\w seems to be defined narrowly as [A-Za-z0-9_], probably just to parse source-code type texts. And \b is simply (\w\W|\W\w). So using \b has the net effect of turning any character outside simple ASCII letters/digits into white-space letters. There doesn’t seem to be an easy way out to deal with this.

An option to deal with this is to omit the \b wrapping altogether – a good idea because this will not work on any language outside English, which is quite restrictive if you ask. Not the entire world speaks English…

Put a warning on the regexp filter setting that uses must manually wrap their regexp’s in \b if they are dealing with strict English.

This has multiple benefits:

For English – anyone who is capable of entering a regexp string should know how to put in a pair of \b's
For W. European languages – i.e. the extended Latin set, they can put \b around all the words that contain ASCII endings, and do more precise filtering on words with non-ASCII endings/beginnings.
For CJK languages – do nothing, simply search character-for-character. CJK languages are mostly not written with strict white-spacing between words, so there is no point to artificially search for a word based on white-space surrounding the words because those white-spaces won’t be there. White-space is not used to delimiter words. In fact, for Chinese, Japanese and Chinese characters in Korean, words are not separated from one another; they stick together to form a single stream and there is nothing to break them apart other than context.
For other languages – e.g. Arabic etc. you are no worse off than without the \b wrapping. In fact, with the \b wrapping, the user can do nothing. Without them, the user can still do some filtering.

elijah · August 31, 2017, 4:16am

In Ruby, I’d do something like:

censor_list.split('|').map!{|w| w.gsub(/^\w+$/, "\b#{w}\b") }.join('|')

So that word boundaries are automatically applied only to things that look like words.

irb(main):011:0> "aaa|Здравствуйте|bbb".split('|').map!{|w| w.gsub(/^\w+$/, "\b#{w}\b") }.join('|')
=> "\baaa\b|Здравствуйте|\bbbb\b"
irb(main):012:0>

Stranik · September 29, 2017, 8:50am

Perhaps with the help of regular expressions it will not be possible to rewrite this. (
I tried it, but it’s a terrible option (the introduction of cycles.) It works, but this option is very bad.

Summary

function escapeRegexp(text) {
  return text.replace(/[-/\\^$*+?.()|[\]{}]/g, '\\$&');
}

export function censorFn(censoredWords, censoredPattern, replacementLetter) {

  let patterns = [];

  replacementLetter = replacementLetter || "&#9632;";

  if (censoredWords && censoredWords.length) {
    patterns = censoredWords.split("|").map(t => `(${escapeRegexp(t)})`);
  }

  if (censoredPattern && censoredPattern.length > 0) {
    patterns.push("(" + censoredPattern + ")");
  }

  if (patterns.length) {
    let censorRegexp;
    
    try {
        return function(text) {
          let original = text;
          let ind=0; 
            
          patterns.forEach(function(item, i, arr) {
            item=item.replace(')','');
            item=item.replace('(','');
                                                                                                                                                                                                                                                                
            original=maxsearch(original,item,ind);
            
           });
            return original;
        };    
    
    } catch(e) {
      // fall through
    }
  }

  return function(t){ return t;};
}

export function maxsearch(original,item,ind){ 
    let mas=original.split(' ');
    ind=mas.indexOf(item); 
    if (ind>=0){
        mas[ind]='…';
        return maxsearch(mas.join(' '),item,ind);
    }else{ 
        return mas.join(' ');
    }
}

export function censor(text, censoredWords, censoredPattern, replacementLetter) {
  return censorFn(censoredWords, censoredPattern, replacementLetter)(text);
}

schungx · January 11, 2018, 5:49am

This issue is now solved by:

To control your own censor patterns, turn on Settings > Posting > watched words regular expressions.

Beware, your Watched Words will now be raw regular expressions and you’ll need to put in your own word break \b where necessary.

jomaxro · January 12, 2018, 11:00pm

This topic was automatically closed after 40 hours. New replies are no longer allowed.

Topic		Replies	Views
Censor words should support sentence level censoring for Chinese Feature	8	1504	January 12, 2018
Censored words do not respect word boundaries in non-latin alphabet Bug pr-welcome	8	1494	November 29, 2018
Watched words regex: word boundary not working as expected Bug	5	1403	January 25, 2018
* wildcards in Watched Words (Censor) don't work Feature	20	3068	January 11, 2018
Watched words tricks Feature watched-words	5	1054	June 6, 2020

Censored pattern

Related topics