Watched Words Improvement -- similar looking unicode characters

markersocial · August 5, 2019, 3:00am

For watched words, I think it could be improved if similar unicode characters also matched.

For example:
abcabcabc
𝘢𝘣𝘤𝘢𝘣𝘤𝘢𝘣𝘤
𝒂𝒃𝒄𝒂𝒃𝒄𝒂𝒃𝒄
ab𝘤𝘢𝘣𝒄𝒂𝒃𝒄

Essentially allows spammers to have a lot of variations of the same words to circumvent the word filter. I’ve been getting hammered by crafty motivated spammers so they’ve really been pushing Discourse’s anti-spam features to the absolute limit. This is one of the techniques they’re using.

Perhaps this could be useful: https://github.com/janlelis/unicode-confusable

codinghorror · August 5, 2019, 3:07am

That’s not “font” that is a different set of unicode characters.

markersocial · August 5, 2019, 3:17am

Ah my bad, thanks for the correction. Updated the post.

codinghorror · August 5, 2019, 4:00am

Unlikely, as that kind of unicode “looks like” matching is extremely expensive in CPU time and also very finicky to get right, because who decides what “looks like” something else?

I suggest you should consider other methods of dealing with these spammers.

In the meantime, just add common variations of spam terms as needed in different unicode characters.

Topic		Replies	Views
Bypassing watched words with confusable character replacements Support watched-words	2	60	December 17, 2024
Russian characters in Watched Words list are failing to be properly identified Bug watched-words	1	514	February 10, 2021
Accented characters cause false postives in Watched Words Bug watched-words	3	406	May 22, 2023
`levenshtein distance spammer emails` should flag accounts that are similar even if no accounts have been marked as spammers yet Feature	61	5429	November 22, 2016
Watched words tricks Feature watched-words	5	1045	June 6, 2020

Watched Words Improvement -- similar looking unicode characters

Related topics