Blocked Canonical Gmails - Issue

It seems that some spammers that have their gmail blocked are getting through with period variations despite having the canonical version of their email blocked.

E.g.
examplemailaddress@gmail.com is blocked
e.x.a.m.pl.e.m.ai.lad.dre.ss@gmail.com is getting through

It seems that the blocking is still working, just not entirely, as I’m still seeing regular matches against the record in logs → screened emails, but not for all combinations. The user was able to make a few hundred accounts today using the same blocked gmail.

The gmail dot variations they are using seem to be between 6 and 14 periods, the email length is 19 (before @), they aren’t using + variations (or all of those variations are being blocked successfully).

Might be relevant, I have levenshtein distance spammer emails set to 3 (default is 2). Discourse was recently updated from 2.6.x to 2.7.1 stable.

2 Likes

Hmm, I forget where we landed on this one @sam, but that would possibly be a bug, since you said

This means that if evil.person+77@gmail.com gets blocked we will go ahead and block evilperson@gmail.com instead. Then when e.v.i.l.person@gmail.com tries to sneak in they will be blocked due to canonical matching.

3 Likes

So what happens when sara.hanson@ does something awful and and sarah.anson@ gets caught in the crossfire? This is just like how I’m not sure joe98@ and joe99@ could be considered the same email address either. I suppose this depends upon the membership of the community and the level of manual discretion used in the matching process.

“Plus addressing” at least refers to a folder belonging to the mailbox of the same email address (given that everything before the “+” is the same).

Perhaps combat registration by IP range? All of this depends upon how sophisticated the spammers are. Coming here from the Let’s Encrypt community, we have a tracking thread over there detailing some pretty broad spamming tactics that have been attempted. We’ve even had people provide actual technical help before spamming weeks later.

1 Like

Not possible; those are the same account from gmail’s perspective, so they’d be the same person.

6 Likes

Interesting. I never realized that gmail actually made that distinction. Learned more than a few new things today. I wonder why they would do that? :thinking: Seems like it would eat up a fair amount of real estate. Are gmail addresses the only concern here?

2 Likes

I think we landed on “I am uncomfortable of the place we ended up, cause it is a support nightmare and is never going away :)”.

I feel like if a site is a spam vector, they should be allowed to say “make all my emails canonical” I don’t care about the downsides.

Meaning these 2 emails both have the canonical of samsam@somewhere.com

sam.sam@somewhere.com
samsam+11@somewhere.com

If sam.sam@somewhere.com registered, samsam+11@somewhere.com can no longer register.

That was my original fix, which I ended up reverting (though it special cased for Google - which in retrospect was not harsh enough)

I feel we should just put this one behind us by adding a new site setting for:

“OMG I am a giant spam vector, put on mega tinfoil mode”

Regarding the bug, stuff can sneak in now easily if you wait to block. It is currently a 100% reactive process.

Meaning this works just fine (feel free to test in console @markersocial ):

./launcher enter app
rails c
ScreenedEmail.block('examplemailaddress@gmail.com')
ScreenedEmail.should_block?('e.x.a.m.pl.e.m.ai.lad.dre.ss@gmail.com')
# true

The problem is:

# 100s of accounts created
ScreenedEmail.block('examplemailaddress@gmail.com')
# 100s of accounts are still there
7 Likes

Oh right the original request which was to block all emails with special characters in them, behind a site setting. I thought I proposed this and you didn’t like it? I can’t remember.

2 Likes

I think this all boils down to @markersocial wanting a feature (forced canonical as I originally implemented) that none of our thousands of our hosted customers appear to need.

We can keep refining the reactive feature (search for canonicals when blocking and prompt admin to delete noise accounts) . Though I would prefer to hear some repeat complaints first.

Regex based blocking will certainly not work for @markersocial but I am happy for him to confirm.

I have no repro of the issue in the OP and strongly suspect the problem accounts were created prior to the block being added

4 Likes

I can confirm that the original fix worked perfect and solved this issue with gmails. It would be a real life saver if this optional mode was returned.

Spammers are constantly learning new techniques and are still successfully gaming big players like Facebook, Instagram and Twitter. This makes most other places ‘ez mode’. It’s a full time job for many of them, so it essentially becomes:

If exploitable and (resources required < money earned), then it will be exploited.

They can get around practically any measure, the only hope is to increase the costs of doing so to a point it is not financially rewarding to do so.

Being able to bulk spam with close to unlimited emails/accounts (prior to detection and a mod/admin retroactively blocking their canonical gmail and manually removing their posts) is quite cost efficient. More so if there is not a team of 24/7 moderators.

The cost to get around anti spam measures continues to decrease. One example is 4/5g proxies, for something like $30-$50 or so per month people can get access to virtually unlimited real mobile ips, from legitimate ISPs/ASNs that automatically/manually rotate and are shared by entire cities/states of legitimate users from major ISPs. 4/5g ips are shared by many users simultaneously.

Blocking these ISPs/ASNs or IPs is not suitable (can’t just block everyone using verizon, at&t etc.). They generally use the ip once and dump it. The blocked individual IPs from this will also block legitimate users who are sharing that IP address at random. IP blocking is slowly becoming a legacy practice (excluding ASNs of known hosting companies). You can see the tip of the iceberg on these forums:

https://mpsocial.com/c/public-marketplace/61
https://www.blackhatworld.com/forums/proxies-for-sale.112/

I believe the spammers are a mixture of fully or partially hand-rolled bots and manual spam. As Discourse takes more market share, which it clearly is growing fantastically, I’d be surprised if it doesn’t become a target of commercially available bots.

Whenever Xrumer starts supporting the latest recaptcha version, I’d say most webmasters on legacy forums notice a large uptick in spam due to the rock bottom cost of spamming (no longer need to use a captcha solving API, which are already very cheap per 1k solves):

http://botmasterlabs.net/buy1/

People can already make their own plugins/scripts to support basically any platform using Xrumer. But if they support Discourse out of the box some day:
bad time

I can’t claim to be impartial on this, seeing I’m in direct need of anti-spam measures. The original post about the gmail dot trick was created by someone else in 2014 and seems that another user solved this by requiring approval on the first x amount of posts, so maybe there is at least three user reports? :sweat_smile:

Sorry for the tangent, back on track.

Regarding the regex blocking for emails, yes you are correct. It is a partial solution, but not ideal for these reasons:

If blocking all gmails with 1 period or more before @:

  • It will unavoidably block real legitimate gmail users that have either 1 or more periods in their gmail, which is very common.
  • The spammers can still create quite a lot of variations with one period. e.g. gmail has a maximum length of 30 characters e.g. 12345678901234567890123456789.0@gmail.com will have 30 usable combinations with a single period.

Blocking all gmails with 2 periods or more before @:

  • Less legitimate gmails blocked, but still will block legit gmail users that have more than 1 period in their email.
  • The spammers can create a lot more variations with a single 30 character gmail. I think ~842 combinations or so.

I can confirm that the new accounts came through after the block was active, as the block created date is Feb 1. I was watching new accounts being created yesterday while seeing both cases of the block rule having new recent matches as well as new registrations coming in using the combinations of the same email (periods only).

I disabled registrations overnight and have re-enabled them this morning. They have created 104 new accounts so far today with permutations of that gmail address and continuing to register more. I can confirm that once the periods are removed from the emails of these accounts it is an exact match with the Screened Emails blocked record.

I tried testing the blocks in rails c as described, this is where it gets a bit weird.

So it seems that some records are returning ‘true’ as intended and some are returning ‘false’ even if the email tested is an exact match to the canonical blocked email. For the records that return ‘true’, it worked entirely as intended and returned true for all the variations that I tested. But the emails that return false, all variations I tested returned false also.

I was trying to find any correlations. I can confirm these are not correlated (or at least not consistently correlated):

Email length (before @)
Email containing characters and numbers
Matches (amount of times blocked)
Matched date

It does seem like there is a correlation with the block creation date though, older being less likely to work (returns false). Records that were created 9d ago returned a mix of true/false and all records I’ve tested so far that were created earlier than that (1h-8d) are returning true.

Could maybe be related to ‘max age unmatched emails’ perhaps? I think this option is somewhat new, I have it set at the default value of 365 days.

1 Like

Well, if you can come up with detailed repro steps for a bug, we’ll definitely fix it.

max age unmatched emails is not a new setting, though – along with max age unmatched ips this is a tool for cleaning up really old entries in the screened IP and Email lists respectively, entries that have not matched anything in a year.

3 Likes

I am going to need exact examples here. If there are bugs then I certainly want to fix them.

2 Likes

I do hear you on this, I think a major objection @codinghorror had about the original implementation was that we were carrying special Google logic. This made Jeff pretty uneasy.

I guess the refinement of “everything is turned canonical, regardless of domain” alleviates this concern a bit.

Eg:

sam+982@sam.com → allowed to register … first sam@sam.com
s.a.m.@sam.com → not allowed to register … second time I noticed sam@sam.com and that canonical is already registered.

This may return some day we just need to find this abuse elsewhere. Last time I investigated we did not come across this abuse on our hosting.

3 Likes

Thanks @sam @codinghorror :slight_smile:

Only have a little bit of time to post today, but wanted to share some additional information before responding more thoroughly.

I’ve found that deleting a record that is returning false from logs → screened email (allow), then blocking the email again (by delete user + block on the user’s admin page) has made a previously failing rule consistently return true now for the direct match and variations.

This seems to match with the observation of the issue being with older records. Will need to test more.

4 Likes

There’s always the bridgekeeper’s way of (randomly) vetting newbies… :grin:

color

assyria

swallow

2 Likes