`levenshtein distance spammer emails` should flag accounts that are similar even if no accounts have been marked as spammers yet

AFAIK the algo stops new registrations.
It doesn’t stop bulk registration of seed accounts that haven’t been added to the filter yet.

I was referring to the specific examples given directly above. As to why it did not work in your case, I am not sure. You could easily test by signing up for a bunch of new accounts using email plus addressing and a single character.

1 Like

Not too worried about testing at the moment, will wait for Arpit to look into this as you requested above.

I think the problem wasn’t a failure in the levenshtein distance calculation, but the fact that all the new accounts were registered before any of them were blocked. That’s a different issue.

9 Likes

Alrighty, let me turn this post into a feature request as well, give me a few minutes to update the OP.

Edit: and done.

3 Likes

Agreed that this would be a new feature request. I don’t think I worked on the levenshtein, but that code definitely only looks at emails from deleted users who were added to the screened list(s).

4 Likes

Right now it’s handy to be able to use username+somestring@gmail.com to create a second account with the same username. I recommended it to someone here. If the intent is to keep people from creating multiple accounts, though, the code should strip dots from (at least) gmail addresses as well as + up to the @.

1 Like

Irrelevant, unless you are creating emails that are literally different by only 2 characters – such as name+a@example.com

I don’t see how it’s irrelevant. Right now you can create an infinite number of accounts with a single gmail address like this

and so on are all the same mailbox. If the intent is to make it difficult for one person to create multiple accounts by making them get more mailboxes, the current code doesn’t do that.

4 Likes

Oh, I misunderstood your point. So to clarify @jomaxro here’s how it works.

  1. At the time of new account creation
  2. Check the last 100 screened (blocked) email addresses in the block list
  3. Is the current new user’s email address within 2 characters levenshtein distance of the last 100 screened email addresses?
  4. If it is, block signup by that new user

I think the reason you saw this as “not working” is because none of these addresses were blocked at the time the users signed up.

You guys were cleaning up after the fact only.

I think it would be an excellent feature to check email addresses at Registration regardless of if there is similar in the Screened Email list.

Bulk registration of these types of aliased email addresses is almost always a sure indication of mischievous intent.

In fact the only legitimate use that I can think of is if / when an Admin / Dev wanted to create multiple testing accounts without thinking up more unique names.

4 Likes

@codinghorror & @pfaffman, thanks for the discussion. I have 2 points I would like to make here.

  1. Jeff, you are correct that this is what was happening. This happened early morning EST, so myself and all the North American based moderators were offline. That left only one mod (from France) to deal with the spam. We were hit quite hard - over 20 accounts all seemingly at the same time, so her primary concern was to deal with what could be seen publically, then deal with the mess of blocked accounts later. She wasn’t really thinking “let’s delete the account so Discourse will block the IP”, she was just trying to stop the barrage of spam from these accounts.

  2. Which brings me to point #2, she shouldn’t have to worry about what she is doing with the spammer’s accounts - blocked, suspended, deleted, etc. While I haven’t looked this closely at the logs yet, these accounts were all seemingly created within an hour or so of each other. This current feature assumes that a spammer attempts to create a new account with a similar email after being deleted from the site. It doesn’t deal with lots of spam all at once before a moderator has a chance to do cleanup.
    You talk about enabling community moderation. Many of these posts were flagged. Despite this, with the current feature implementation that wouldn’t have mattered, even if the posts got 3 flags and were hidden for moderator intervention. We should never have gotten into the mess we were in, the software should be smart enough to say “hey! This seems odd that someone is creating 12 accounts seconds after each other with nearly identical emails.” I cannot think of any normal reason why multiple accounts with nearly identical emails would need to be registered in immediate succession. The only example is the one @Mittineague pointed out, an Admin/Dev might do that, but they added effort of needing to approve their own accounts is far outweighed by the spam protection this affords - and an at registration email check could bypass the registration queue if the IP matches on of a staff account.

I turned this from a support ticket to a feature request because I now understand that this feature did function as it was intended to. I am suggesting that it needs to be expanded to prevent what we say at Stonehearth the other day.

2 Likes

What you really need to be asking is, why didn’t Akismet catch these, because all new user posts go through Akismet.

Note that if you do not delete the user as spam, their posts are not fed to Akismet as spam for Bayesian training, so that is rather critical.

Honestly, I don’t care what catches the spam. Yes, Akismet should have caught the posts as spam. Super similar emails should be put into an approval queue. Unexpected increases in user registrations in a short period of time should through up a red flag. Regardless, this stuff wasn’t caught, and that’s what concerned us.

All posts that were flagged as spam were “agreed with” and subsequently deleted. Again, however, this relies on moderator action to work (delete user as spam) - what I’m asking for are more tools to stop spam when a moderator isn’t online.

That is exactly what Akismet is for, and should do. Every new user post is passed to Akismet for a spam test.

Correct me if I’m wrong, Akismet only checks new users posts, right? This won’t catch someone who makes multiple accounts with similar emails, gets passed TL0, and then starts posting spam, right?

I’m all for Akismet catching spam posts, but I’m confused what that has to do with a completely different method of prevent spam.

Did these users get to TL1?

Akismet checks TL0 posts immediately and TL1 posts later. It does not check TL2 or higher posts at all.

No they didn’t. They were all TL0. My comment was simply that it is possible to get around Akismet by hanging out long enough.

Only if you get to TL2, so I am not sure what you are talking about.

I understand that - not questioning what users Akismet scans.

I was trying to point out that Akismet catching spam and checking for similar emails on registration are two different methods of trying to prevent spam. A spammer could hang out for 15 days (and reach TL2) before starting to spam (however unlikely that is). Akismet isn’t perfect (we saw that full well at Stonehearth even as we confirmed posts were spam), so I’m trying to suggest other methods to make life harder for spammers.