Bypassing watched words with confusable character replacements

I’m not 100% sure this a bug. But watched words can’t match weirdly encoded posts. Or I can’t write the regex to make it match them.

I have been hit with a bunch of spam lately. Watched words helped initially but now they are posting in a weird encoding. The spam looks like this:

I was filtering on “helpline” - which is pasted here:

𝐇𝐞𝐥𝐩𝐥𝐢𝐧𝐞

But if I look at what they are posting in ASCII, it looks like the table below. It is some kind of weird encoding. Mime maybe?

I’m working through the related thread on “Tips for Preventing Spam” and starting to lock things down more. But this type of posting is challenging.

This isn’t an encoding problem… the term you’re looking for is Confusable Unicode characters.

Even if it were just the letter C we would have:

ℂ 𝕮 C 𝙲 𝑪 𝒞 𝖢 𐌂 𝗖 ꓚ Ꮯ 𐔜 С C Ⲥ 𝓒 Ⅽ ℭ 𝘊 𐐕 𝘾 🝌 𝐂 𑣲 𑣩 Ϲ 𐊢 𝐶

This is probably best presented as a feature request, something along the line of an option to “Replace confusable unicode characters before applying Watched Words”.

FYI, the set of confusable characters is public data published by the consortium, e.g.:

#	C	Ⅽ	C	ௐ	С	Ꮯ	Ⲥ	ꓚ	Ϲ	ℂ	ℭ	𝐂	𝐶	𝑪	𝒞	𝓒	𝕮	𝖢	𝗖	𝘊	𝘾	𝙲
	(‎ C ‎)	0043	 LATIN CAPITAL LETTER C
←	(‎ Ⅽ ‎)	216D	 ROMAN NUMERAL ONE HUNDRED
←	(‎ C ‎)	FF23	 FULLWIDTH LATIN CAPITAL LETTER C	# →С→
←	(‎ ௐ ‎)	0BD0	 TAMIL OM	# →С→
←	(‎ С ‎)	0421	 CYRILLIC CAPITAL LETTER ES
←	(‎ Ꮯ ‎)	13DF	 CHEROKEE LETTER TLI
←	(‎ Ⲥ ‎)	2CA4	 COPTIC CAPITAL LETTER SIMA	# →Ϲ→
←	(‎ ꓚ ‎)	A4DA	 LISU LETTER CA
←	(‎ Ϲ ‎)	03F9	 GREEK CAPITAL LUNATE SIGMA SYMBOL
←	(‎ ℂ ‎)	2102	 DOUBLE-STRUCK CAPITAL C
←	(‎ ℭ ‎)	212D	 BLACK-LETTER CAPITAL C
←	(‎ 𝐂 ‎)	1D402	 MATHEMATICAL BOLD CAPITAL C
←	(‎ 𝐶 ‎)	1D436	 MATHEMATICAL ITALIC CAPITAL C
←	(‎ 𝑪 ‎)	1D46A	 MATHEMATICAL BOLD ITALIC CAPITAL C
←	(‎ 𝒞 ‎)	1D49E	 MATHEMATICAL SCRIPT CAPITAL C
←	(‎ 𝓒 ‎)	1D4D2	 MATHEMATICAL BOLD SCRIPT CAPITAL C
←	(‎ 𝕮 ‎)	1D56E	 MATHEMATICAL BOLD FRAKTUR CAPITAL C
←	(‎ 𝖢 ‎)	1D5A2	 MATHEMATICAL SANS-SERIF CAPITAL C
←	(‎ 𝗖 ‎)	1D5D6	 MATHEMATICAL SANS-SERIF BOLD CAPITAL C
←	(‎ 𝘊 ‎)	1D60A	 MATHEMATICAL SANS-SERIF ITALIC CAPITAL C
←	(‎ 𝘾 ‎)	1D63E	 MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL C
←	(‎ 𝙲 ‎)	1D672	 MATHEMATICAL MONOSPACE CAPITAL C

(hah, I just discovered that chromium implements this simplification, when attempting to search for 𝑪 it found most confusable instances of the letter C - try it here!)

An example algorithm using this data is implemented in python here.

In your last screenshot @billgraziano the first ampersand-encoded numeric character reference you post ( 𝐇 ) would be:

←	(‎ 𝐇 ‎)	1D407	 MATHEMATICAL BOLD CAPITAL H

(the ordinal for 𝐇 is 119815 encoded in decimal or 1D407 encoded in hex)

So a complete case-insensitive regex for “helpline” you could use would be:

[HHℋℌℍ𝐇𝐻𝑯𝓗𝕳𝖧𝗛𝘏𝙃𝙷Η𝚮𝛨𝜢𝝜𝞖ⲎНᎻᕼꓧ𐋏ⱧҢĦӉӇhhℎ𝐡𝒉𝒽𝓱𝔥𝕙𝖍𝗁𝗵𝘩𝙝𝚑һհᏂɦꚕᏲħℏћ][E⋿Eℰ𝐄𝐸𝑬𝓔𝔈𝔼𝕰𝖤𝗘𝘌𝙀𝙴Ε𝚬𝛦𝜠𝝚𝞔ЕⴹᎬꓰ𑢦𑢮𐊆Ɇe℮eℯⅇ𝐞𝑒𝒆𝓮𝔢𝕖𝖊𝖾𝗲𝘦𝙚𝚎ꬲеҽɇҿ][L𝈪Ⅼℒ𝐋𝐿𝑳𝓛𝔏𝕃𝕷𝖫𝗟𝘓𝙇𝙻ⳐᏞᒪꓡ𖼖𑢣𑢲𐐛𐔦ŁLjLJl‎\|∣⏽│1‎۱𐌠‎𝟏𝟙𝟣𝟭𝟷IIⅠℐℑ𝐈𝐼𝑰𝓘𝕀𝕴𝖨𝗜𝘐𝙄𝙸Ɩlⅼℓ𝐥𝑙𝒍𝓁𝓵𝔩𝕝𝖑𝗅𝗹𝘭𝙡𝚕ǀΙ𝚰𝛪𝜤𝝞𝞘ⲒІӀ‎‎‎ⵏᛁꓲ𖼨𐊊𐌉‎‎łɭƗƚɫ‎‎‎ŀĿᒷ🄂⒈‎⒓㏫㋋㍤⒔㏬㍥⒕㏭㍦⒖㏮㍧⒗㏯㍨⒘㏰㍩⒙㏱㍪⒚㏲㍫ljIJ‖∥Ⅱǁ‎𐆙⒒Ⅲ𐆘㏪㋊㍣Ю⒑㏩㋉㍢ʪ₶ⅣⅨɮʫ㏠㋀㍙][PPℙ𝐏𝑃𝑷𝒫𝓟𝔓𝕻𝖯𝗣𝘗𝙋𝙿Ρ𝚸𝛲𝜬𝝦𝞠ⲢРᏢᑭꓑ𐊕ᒆp⍴p𝐩𝑝𝒑𝓅𝓹𝔭𝕡𝖕𝗉𝗽𝘱𝙥𝚙ρϱ𝛒𝛠𝜌𝜚𝝆𝝔𝞀𝞎𝞺𝟈ⲣрƥᵽᑷ][L𝈪Ⅼℒ𝐋𝐿𝑳𝓛𝔏𝕃𝕷𝖫𝗟𝘓𝙇𝙻ⳐᏞᒪꓡ𖼖𑢣𑢲𐐛𐔦ŁLjLJl‎\|∣⏽│1‎۱𐌠‎𝟏𝟙𝟣𝟭𝟷IIⅠℐℑ𝐈𝐼𝑰𝓘𝕀𝕴𝖨𝗜𝘐𝙄𝙸Ɩlⅼℓ𝐥𝑙𝒍𝓁𝓵𝔩𝕝𝖑𝗅𝗹𝘭𝙡𝚕ǀΙ𝚰𝛪𝜤𝝞𝞘ⲒІӀ‎‎‎ⵏᛁꓲ𖼨𐊊𐌉‎‎łɭƗƚɫ‎‎‎ŀĿᒷ🄂⒈‎⒓㏫㋋㍤⒔㏬㍥⒕㏭㍦⒖㏮㍧⒗㏯㍨⒘㏰㍩⒙㏱㍪⒚㏲㍫ljIJ‖∥Ⅱǁ‎𐆙⒒Ⅲ𐆘㏪㋊㍣Ю⒑㏩㋉㍢ʪ₶ⅣⅨɮʫ㏠㋀㍙][Ii˛⍳iⅰℹⅈ𝐢𝑖𝒊𝒾𝓲𝔦𝕚𝖎𝗂𝗶𝘪𝙞𝚒ı𝚤ɪɩιιͺ𝛊𝜄𝜾𝝸𝞲іꙇӏꭵᎥ𑣃⍸ɨᵻᵼⅱⅲijⅳⅸ][NNℕ𝐍𝑁𝑵𝒩𝓝𝔑𝕹𝖭𝗡𝘕𝙉𝙽Ν𝚴𝛮𝜨𝝢𝞜Ⲛꓠ𐔓𐆎ƝNjNJ№n𝐧𝑛𝒏𝓃𝓷𝔫𝕟𝖓𝗇𝗻𝘯𝙣𝚗ոռɳƞη𝛈𝜂𝜼𝝶𝞰ᵰnj][E⋿Eℰ𝐄𝐸𝑬𝓔𝔈𝔼𝕰𝖤𝗘𝘌𝙀𝙴Ε𝚬𝛦𝜠𝝚𝞔ЕⴹᎬꓰ𑢦𑢮𐊆Ɇe℮eℯⅇ𝐞𝑒𝒆𝓮𝔢𝕖𝖊𝖾𝗲𝘦𝙚𝚎ꬲеҽɇҿ]
5 Likes

@supermathie you are incredibly helpful! That is it.

I enabled regex on Watched Words, hacked something up, and we will see how that goes. I’m also working on some of the other tips in spam blocking.

After years of almost no spam, someone seems to have found my nice clean forums :frowning:

2 Likes