`levenshtein distance spammer emails` should flag accounts that are similar even if no accounts have been marked as spammers yet

Update: After detailed discussion, it seems that this setting is working as currently coded - it only checks emails against a list after an account is deleted. As this does not help when a spammer makes many accounts immediatly before spamming the site, this should be updated to check emails against all recent registrations and put users into approval queue if the similarity is too high.


Working to update spam settings after recent attack at Stonehearth. While cleaning up the damage, we found that someone had made multiple accounts, all with suspiciously similar email addresses. Looking closer, we realized that all the emails are identical, at least as far as Gmail is concerned - but it seems Discourse treated them as different emails. From reading about levenshtein distance on Google, it seems to be a measurement of string simularity. With a default value of 2, I would have assumed these accounts would have been caught, as they only moved the period.

Thoughts?

Emails in question:

q.kzkfkzkwlsh1@gmail.com
qk.zkfkzkwlsh1@gmail.com
qkz.kfkzkwlsh1@gmail.com
qkzk.fkzkwlsh1@gmail.com
qkzkf.kzkwlsh1@gmail.com
qkzkfk.zkwlsh1@gmail.com
qkzkfkz.kwlsh1@gmail.com
qkzkfkzk.wlsh1@gmail.com
qkzkfkzkw.lsh1@gmail.com
qkzkfkzkwl.sh1@gmail.com
qkzkfkzkwls.h1@gmail.com
qkzkfkzkwlsh1@gmail.com
7 Likes

It’s a good point, can you make sure we have tests to catch this case @techAPJ? It does seem like it should have caught these.

1 Like

Thanks for getting this looked into @codinghorror, still curious what the setting does (unless this is a bug and should have been caught)?

Further, this brings up the issue of sub-addressing (plus addresses). Will/can those be caught as well?

This is the best read for understanding it

In short it tells you the minimum number of changes needed to meet the other word. My guess is that the period is forcing the number to be higher than your setting. I’d have to run the phrases through the actual ruby method to confirm that though

3 Likes

Why would a period be any different than another “single character edit”. Shouldn’t it be a simple insertion and thus have a value of one? (Also haven’t seen the ruby method…).

Well insert is different than inline edit per the kitten to sitting example in the wiki definition. My thought is the insertion is detecting all characters after that insertion to be considered different. All theory right now though I agree it should count as 1 and not 15.

By ruby method, I mean ruby’s implementation of it. Just like c# had its own, php has one, and so forth.

If I get a chance I’ll try it out tomorrow.

Yes, and no. Yes it is different from edit. No, it doesn’t change the value used in the algorithm. Above in the article it talks about 3 types of edits. Specifically:

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

From that, and a nice page from the University of Pittsburgh insertions, deletions, and substitutions all (each) count as 1 for the sum.

Right. That is how it should work. Doesn’t mean the ruby implementation is doing that though (I suspect it is) or that something else is interfering with it being executed. Such as, one of the accounts being TL 1 or higher (more likely)

1 Like

All TL0 when deleted, all created today, 9/7/16 around 11:40 UTC (plus or minus 10 minutes). Look forward to your exploration of the Ruby code!

ScreenedEmail.levenshtein('q.kzkfkzkwlsh1@gmail.com', 'qk.zkfkzkwlsh1@gmail.com')
=> 2
ScreenedEmail.levenshtein('q.kzkfkzkwlsh1@gmail.com', 'qkzkfkzkwls.h1@gmail.com')
=> 2
ScreenedEmail.levenshtein('q.kzkfkzkwlsh1@gmail.com', 'qkzkfkzkwlsh1@gmail.com')
=> 1

There doesn’t appear to be any TL-related limitations on skipping the blocking, however it only checks the most recent 100 screened emails; if some scuzzbucket is creating lots of accounts with different e-mails, and occasionally coming back to using another e-mail address in the pattern, you might end up with them not being caught.

4 Likes

Nope, they should definitely have been within 100 checks of each other. Do you know if the site setting is < or <=?

Definitely <=:

max_distance = SiteSetting.levenshtein_distance_spammer_emails
screened_emails.select { |se| distances[se.email] <= max_distance }

So as long as your setting was at or above 2, these should have been caught.

2 Likes

According to an Admin on the site, it was (and is) set at 2. You guys host us, feel free to take a look yourself, dig into the logs, do whatever you need. We’ve dealt with the initial issue (overwhelming spam) hours ago, just trying to prevent it from happening again.

Wow, you really did get hammered… the staff actions log on your site is quite busy. My condolences. :grin:

From the look of it, the spammer created all their accounts before you started deleting them, is that correct? If so, there’s the issue: we only put e-mails on the screening list after their associated account gets whacked. It stops miscreants from coming back after they’ve been banned with near-identical e-mail addresses, but doesn’t stop someone from registering a bunch of accounts with similar addresses and then going to town.

2 Likes

Thanks. This happened early in the morning US East Coast Time, so our moderator from France took the brunt of it.

Yes. The moderator handling this simply blocked the accounts as they were coming in. Only after things calmed down did I get online and start deleting, and then she noticed the 12 or so accounts that were similar.

I guess my question here is why? To me it shouldn’t matter if someone was deleted or not - ridiculously similar emails (like levenshtein <=2) should be flagged in some way immediately. Otherwise, you get, well, this.

2 Likes

Flagged for review because of an e-mail address very similar to another user, I could see as being reasonable. The current implementation just auto-blocks the new user, which some might consider overly draconian for having an e-mail address that happens to be similar to any other user on the site. It’d be more palatable if the lookup was done on, say, the last hundred signed-up users, rather than everyone. Tricky to get it dialled in just right, though.

@neil, you did some of the early work on this feature, what’s your thoughts?

5 Likes

Absolutely agree here. Just like my other request here I’d like to see this flagged for review, not auto-blocked silently. Also agree that there needs to be a time-period, not forever. 100 days certainly sounds reasonable, and while I don’t have the data to back this up, I’d assume that shortening this substantially (to 7 days even) would likely be effective against most spammers. I’d like to hope most spammers don’t plan ahead and spread out their registrations to avoid detection.

1 Like

I’m thinking an algo that looks for “.” and “+” and any other common alias email account patterns would help.

i.e. BillO@gmail.com, BillW@gmail,com and Bill23@gmail,com could very well be totally innocent.

BillW.abc@gmail.com, BillW.def@gmail.com, BillW.ghi@gmail.com, are more likely to be what I call “seed” accounts.
* rate limiting would be a big plus in stopping this from happening

No, this is really covered by the existing algorithm. Try it yourself in this online form

http://planetcalc.com/1721/

I’m confused. The existing algorithm didn’t catch this… :confused:.