Censored pattern

1. Where to find words? (how to delete, etc.)
2. Russian characters not working

(v1.9.0.beta4 +317 1.9.0.beta4)

3 Likes

This does not look like a regression, I think it never worked:

This regex does not work with Cyrillic .

Anyone have any ideas on how to rewrite:

censorRegexp = new RegExp("(\\b(?:(hello)|(Здравствуйте))\\b)(?![^\\(]*\\))", "ig");

So it does not use \b which is unsupported Russian.

(be sure to reply here with a tested regex I have already seen the posts on Stack Overflow)

4 Likes

That’s a tough one. :grimacing:
I don’t think this can be solved with a regex.

http://unicode.org/reports/tr29/tr29-9.html#Word_Boundaries

OMG I have been trying to find this setting for ages. Almost went ahead to create a dumb topic here just to ask where it is.

Turns out that it is tucked under “Logs”… probably the last place I’d expect to look for it!

Probably should file it under regular “Settings”.

@sam is correct in that \b doesn’t seem to match any Unicode, or any non-ASCII word breaks.

\w seems to be defined narrowly as [A-Za-z0-9_], probably just to parse source-code type texts. And \b is simply (\w\W|\W\w). So using \b has the net effect of turning any character outside simple ASCII letters/digits into white-space letters. There doesn’t seem to be an easy way out to deal with this.

An option to deal with this is to omit the \b wrapping altogether – a good idea because this will not work on any language outside English, which is quite restrictive if you ask. Not the entire world speaks English…

Put a warning on the regexp filter setting that uses must manually wrap their regexp’s in \b if they are dealing with strict English.

This has multiple benefits:

  1. For English – anyone who is capable of entering a regexp string should know how to put in a pair of \b's

  2. For W. European languages – i.e. the extended Latin set, they can put \b around all the words that contain ASCII endings, and do more precise filtering on words with non-ASCII endings/beginnings.

  3. For CJK languages – do nothing, simply search character-for-character. CJK languages are mostly not written with strict white-spacing between words, so there is no point to artificially search for a word based on white-space surrounding the words because those white-spaces won’t be there. White-space is not used to delimiter words. In fact, for Chinese, Japanese and Chinese characters in Korean, words are not separated from one another; they stick together to form a single stream and there is nothing to break them apart other than context.

  4. For other languages – e.g. Arabic etc. you are no worse off than without the \b wrapping. In fact, with the \b wrapping, the user can do nothing. Without them, the user can still do some filtering.

4 Likes

In Ruby, I’d do something like:

censor_list.split('|').map!{|w| w.gsub(/^\w+$/, "\b#{w}\b") }.join('|')

So that word boundaries are automatically applied only to things that look like words.

irb(main):011:0> "aaa|Здравствуйте|bbb".split('|').map!{|w| w.gsub(/^\w+$/, "\b#{w}\b") }.join('|')
=> "\baaa\b|Здравствуйте|\bbbb\b"
irb(main):012:0> 
1 Like

Perhaps with the help of regular expressions it will not be possible to rewrite this. (
I tried it, but it’s a terrible option (the introduction of cycles.) It works, but this option is very bad.

Summary
function escapeRegexp(text) {
  return text.replace(/[-/\\^$*+?.()|[\]{}]/g, '\\$&');
}

export function censorFn(censoredWords, censoredPattern, replacementLetter) {

  let patterns = [];

  replacementLetter = replacementLetter || "■";

  if (censoredWords && censoredWords.length) {
    patterns = censoredWords.split("|").map(t => `(${escapeRegexp(t)})`);
  }

  if (censoredPattern && censoredPattern.length > 0) {
    patterns.push("(" + censoredPattern + ")");
  }

  if (patterns.length) {
    let censorRegexp;
    
    try {
        return function(text) {
          let original = text;
          let ind=0; 
            
          patterns.forEach(function(item, i, arr) {
            item=item.replace(')','');
            item=item.replace('(','');
                                                                                                                                                                                                                                                                
            original=maxsearch(original,item,ind);
            
           });
            return original;
        };    
    
    } catch(e) {
      // fall through
    }
  }

  return function(t){ return t;};
}

export function maxsearch(original,item,ind){ 
    let mas=original.split(' ');
    ind=mas.indexOf(item); 
    if (ind>=0){
        mas[ind]='…';
        return maxsearch(mas.join(' '),item,ind);
    }else{ 
        return mas.join(' ');
    }
}

export function censor(text, censoredWords, censoredPattern, replacementLetter) {
  return censorFn(censoredWords, censoredPattern, replacementLetter)(text);
}

This issue is now solved by:

To control your own censor patterns, turn on Settings > Posting > watched words regular expressions.

Beware, your Watched Words will now be raw regular expressions and you’ll need to put in your own word break \b where necessary.

5 Likes

This topic was automatically closed after 40 hours. New replies are no longer allowed.