When using Watched Words, accented characters can cause false positives by splitting a word on the accented character rather than treating it as part of the word. It seems that the word filter treats letters with accents and diacritics as blank spaces instead of part of the same word.
Repro steps:
Add ‘anal’ to blocked Watched Words
As non-admin user, attempt to use analógico
in a post
Post is blocked
Attempting the same with analog
works as intended, and is allowed to be posted.
10 Likes
nizar9
April 24, 2023, 7:43pm
3
I was able to reproduce the same thing on my end. This bug also includes other characters with a cedilla like ç and ş:
3 Likes
nbianca
(Bianca)
May 18, 2023, 3:06pm
10
Support for UTF-8 characters in watched words has been implemented in this PR:
discourse:main
← discourse:fix_utf8
opened 07:17PM - 02 May 23 UTC
Watched words were converted to regular expressions containing \W, which handled… only ASCII characters. Using [^[:word]] instead ensures that UTF-8 characters are also handled correctly.
This should correctly detect word boundaries for all words, including those that contain UTF-8 characters.
3 Likes
nbianca
(Bianca)
Closed
May 22, 2023, 5:00am
11
This topic was automatically closed after 3 days. New replies are no longer allowed.