Spambots from Tor exit points keep taking over my forum


(Lowell Heddings) #1

Extremely annoying. More than one IP address, they aren’t posting links, just a bunch of nonsense from a ton of different accounts.


Can I Keep Nofollow for All User Links, Including from Trust Level 3?
(Lowell Heddings) #2

It’s happened enough that the posts are indexed in Google.

It would be really useful to have a way to block keywords or something.


(lid) #3

I am curious how did they sign up or sign/in to your site?

how did they register? with an email or used one of the external login method (facebook,twitter,etc)?

In general I would say that a good feature request could be a way create an anti bot test(captcha or other) on first signup / signin
as the first defense level.

probably a plugin material


(Lowell Heddings) #4

I don’t know, I just started using the delete spammer feature. I did notice they were all from @outlook.com though, and all from completely different IP addresses.

Looks like it was happening earlier as well…


(Jeff Atwood) #5

These users are creating multiple new accounts from multiple IP addresses, all different. The normal way to block is to just disallow the IP or IP blocks, but this spammer has so many, and none of them are duplicate.

We’re looking at it now.

  • Per the message stream in HAProxy logs it is definitely all human actions in a live browser.

  • They are all Tor exit points.


(Lowell Heddings) #6

Looks like there’s a way to block Tor exit nodes. Probably a useful thing to have built into the forum.

http://www.mediawiki.org/wiki/Extension:TorBlock


(Jeff Atwood) #7

We can’t do much with IPs here other than some kind of arcane Tor-blocking.

One thing I definitely do see, and have seen for a while across a lot of sites, is patterns in spammer email signup names:

testads5017@outlook.com
testads5016@outlook.com
testads5014@outlook.com

and

rakhisai23@hotmail.com
rakhisai24@hotmail.com
rakhisai25@hotmail.com

We already check previously “burned” spammer email addresses before allowing accounts to sign up. I suggest we improve that a bit, @zogstrip.

During signup, at the place where we check for previously “burned” spammer email addresses and IPs, add this check:

  1. Query the last ~100 burned spammer email addresses, ordered by burn date. (We have to cap this in case the spam list gets huge, it might start rejecting real emails.)

  2. Loop through each burned spammer email address.

  3. Compare the levenshtein distance of the current new signing up user email with the current burned spammer email.

     levenshtein_distance('fire', 'water') = 4
    
  4. If it returns <= 2, that is, the current email address is 2 characters or less different than a known spammer email, then reject the account as if it directly matched a spammer email.

I believe this will resolve the majority of the egregious stuff. Go ahead and make this a site setting, levenshtein_distance_spammer_emails and default it to 2.

At minimum, it forces spammers to use much more unique emails … and since they are already using “perfectly” unique IPs in the form of Tor exit points…


(Lowell Heddings) #8

Well, it’s a start. Personally I think a content filter of some type would be much more effective.

As in, if a user signs up and posts 5 topics or 5 posts, and they are all very similar… and also very similar to recent spam, then they should be flagged.


For the moment, I set the option for “admins must approve all new user signups” because I need to get some sleep and I don’t feel like dealing with spam.

(Not like we have a ton of usage from new users anyway)


(Silver Quettier) #9

Perhaps it would be better to check against the ratio (levenshtein distance / email address length)?
Although short email addresses are not in fashion, it could help some cases, particularly concerning short emails and “legit” patterns, such as first name or last name replaced by an initial.

t.stark@gmail.com and n.stark@gmail.com are two unrelated people (they don’t even live in the same universe!)


(Jeff Atwood) #10

Maybe – but most spammers have emails from “official” mail providers like yahoo, outlook, gmail and they aren’t exactly going to get joe@gmail.com. Spammers tend to have rather byzantine, crazy email addresses, nothing short at all. Random sample:

hoangkhanhthu123@gmail.com
arhamanarif@gmail.com
yaritza1095@gmail.com
zeu.thunder@gmail.com
erikayulianti62@gmail.com
jessise13@gmail.com
info@technogala.com
donggialuxury@gmail.com

And if the email comes from a unique domain like yo@spamtownusa.com then that is fine, plenty of levenshtein distance to work with based on the domain piece.


(James Sanderson) #11

I saw a very similar situation on an IRC network a few years back, dealing with spammers flooding them from Tor IP addresses using randomly generated details. What worked in the end was Bayesian filtering, just like you would use for email spam. IP is Tor? Username looks randomly generated? Email is in the form 7 letters 4 numbers? If you figure out the right features to extract you can then train on the probability those features indicate a spammer, and create something adaptive (so if they change strategy you can keep up) and intelligent enough to factor in information like “they are on Tor” without blindly blocking all Tor users.

Another nice thing about probability-based filtering is it’s not binary, so you can create a better user experience. Low probability = post away, really high = instaban, but for people in the middle you could have a sliding scale - flag account for manual review, CAPTCHAs before posting, things like that.


(cpradio) #12

Plus the chances that t.stark@gmail.com in on the spammer burn list would be low if he was indeed a good user. So the check wouldn’t even consider that email address to be tested against.


(Tuan Anh Tran) #13

saw a commit just now to push a fix for this. if the email is similar to a known spambot email, it will be rejected.

this would temporarily fix this issue for @geek but may cause some troubles if spammers use a popular phrase in their email address.


(Michael Downey) #14

We got tons of spam accounts registering from @outlook.com addresses in our directory system as well. We ended up blocking signups from IP’s listed in the SBL blacklist from Spamhaus and the SpamCop blacklist which are generally well-maintained. (Although we occasionally have people blocked inadvertently and have to maintain a whitelist now.)

Since we don’t do Discourse-specific signups this isn’t really an issue for our installation, but it might be something that “someone” may want to think about integrating in some way. :thumbsup:


"Questionable" account checks at the time of signup?
(Gabriel Mazetto) #15

I like the idea of a bayesian filter. We can always plug in “Akismet” or “Defensio” as a way to help flag spam posts and flag out spammers.

There are a lot of interesting plugins for wordpress that we can analyse to extract some defense techniques:

SixApart also had for some time a free and open Akismet alternative, which they released the source. As the service has been retired March this year, it was difficult to find the source again, but I managed to fork someone’s fork and here is the link: GitHub - brodock/AntiSpam: Fork of the AntiSpam framework from Sixapart it maybe also a good starting point figuring out techniques.

Blogspam is also an alternative: http://blogspam.net/

And finally SBlam which do the samething: GitHub - kornelski/Sblam: Server-side HTTP spam filter


(Jeff Atwood) #16

The more “generous” matching against spammer addresses is now in thanks to @zogstrip – so any new user signup emails within 2 characters of the last 100 known spammer emails will be rejected as matching.

That will cover the specific case @howtogeek was seeing.

Beyond that, there are a few basic ways to think of this kind of 100% manual human entered spam:

  1. How are they signing up?
  2. What URLs are they posting?
  3. What IPs are they coming from?
  4. What other content are they posting?
  5. How are they posting it?

For now we opted for an improvement in #1 – better detection of near-duplicate spam email patterns at signup, e.g. spammer12@gmail.com versus spammer13@gmail.com. Remember these are validated emails! There is a human being monitoring every email there, has to be for the signup to work.

In the case of #2 and #3, we’re stuck – they are not posting any URLs, and they are coming from randomized Tor exit points. Very difficult to use either one of those. Which is too bad because IP and URL are both constrained namespaces that make it a bit easier to block with.

(You could certainly block all Tor exit points, or all new users coming from Tor exit points… but that might be bad for, say, a Discourse used by Iranian dissidents.)

The eventual next step would be to get better about retaining spam content and trying to identify patterns in it, e.g. if you look at the image of the spam you’ll definitely see some repeated phrases – the best one is probably the telephone number.


(cpradio) #17

That isn’t true, it is trivial to write a POP3 mailbox reader and tell it to process any links using wget or a HTTP request. That would easily remove the human interaction, but nonetheless, I do agree with the steps taken and I agree with your post as a whole (just not that one statement ;)).


(Jeff Atwood) #18

That’s an interesting point, maybe we should do a JavaScript check on that particular page to make sure the link is really validated, not just http-retrieved via wget.


(Kevin P. Fleming) #19

In the true arms race spirit, you could make the validation page include a Captcha :slight_smile:


(lid) #20

I am not sure the spam @geek got is a pure human interaction(checking other google results with this pattern there are also pictures, so it is possible that if an automation tool was used it failed to insert images),
If the spammer is a human then it is look like we all know the outcome of a war. nobody win!

bottom line
If someone is motivated enough to post a message this message will get there.

  • You block an email pattern - The spammer will start using more legitimate naming.

  • You block ip - The spammer will use either TOR or in worst cases it will use compromised computers of legitimate users.

  • you block phrases and keywords - Spammer will use images

So if there is a human element in this spam attack - Captcha is out, or basically any anti bot test.

in @geek case(based on the screenshot he provided) it is look like the spam is arriving in batches of 5 topics in a minute. this too me looks like a detectable pattern.

What we got?
A new user, no reputation, just joined in and already opening multiple topics, It will be more difficult based on behavior analysis if the user actually reply to a topic rather then open a new topic as multiple replies can actually be considered normal behavior for a new user.

There are two methods I can think that can detect this kind of attack.

  1. we know the user is new
  2. the user open one or more topic in a short time
  3. all topic share an identifier ( phone number)

Possible solution
Based on short bursts of spam attacks, from new no reputable users
So even if the spammer using 10 different accounts to send 10 different posts. the system can analyze
recent posts from new users and run a basic algorithm to detect if the posts are sharing none common phrases.
for example if the posts have the word “hi” then that is a common phrase( we move on) , but if 5 posts share “00971558709955” a long numeric or phone number pattern
then this is not a common phrase. or if the forum is for English speaking and you get this word in several posts then you can make a rules that none English phrase are not common phrases.

in case it is one user that pass a threshold of topic per time frame, we can flag that user as spammer and have his posts go to review mode.

Also another good de-motivator is to have an option that posts from new users with no reputation will not be visible for none registered users, so for example search engines will not index or see those messages. And spammer with the intention of SEO purposes.( the problem with that is that the system might have to mange multiple cache for the same request one for registered users and one for none registered users.

BTW,
Detection of automation tool and client side validation is becoming more difficult thanks to projects like
http://phantomjs.org/