It’s now available
A few more improvements in this area:
- @eviltrout just added a “Suspect” tab on admin, users.
- @sam added the ability to click on the avatar in these lists to quickly view the user card for that user (and the profile page).
- @zogstrip made it so that TL0 profile customizations don’t show up for anon users. They only show up when you reach TL1.
We’ll add some more checks later, but for right now, “Suspect” means users with
- 1 or less topics viewed
- 1 or less posts read
- have filled out “about me” field on their user profile
This is by far the most predictive set of data I’ve seen on users that marks them as profile spammers. And FYI if you see a number at the end of their username, or their username is a random string of chars… that’s also highly predictive.
The solution we came up with is that TL0 profile modifications are no longer visible to anonymous users. They were already invisible to Google of course due to our default robots.txt
.
I appreciate your feedback on this, and I must say, profile spamming was a huge problem on at least one of our partner sites and I personally cleaned most of it up over the last week, so I feel your pain.
We’ve made a ton of improvements in this area as a result. We hate spam, and we want all Discourse instances to be safe out of the box with default settings.
I personally deleted thousands of these profile spammer accounts (to understand better what we’re dealing with) and what you describe happened maybe 5 times out of those thousands. If that.
This is exceedingly rare. For what it’s worth, this is what they look like:
For all that “sophistication” they ran these accounts from the same IP so they were easy to find as 4+ dupe accounts. But yeah, if they ran these accounts from unique IPs, they might have gotten away with it.
But this is so, so very rare based on the samples I have. Your average profile spammer is dumb as a box of rocks.
The crawler can’t even find the profile because there is no link pointing to user profile page in any topics.
Not within the forum perhaps, but elsewhere?
It’s also disallowed in robots.txt.
I could see that happening nicely, maybe with an email reminding them to come back, and making users who haven’t came back in a while fill out a captcha.
Also no content in the profile, except the username.
Are you sure?
JavaScript disabled
UA Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Well, You are right. I checked the code. The bio would show if the user trust level is above 0.
I recently spent about a week cleaning up a ton of spam on a forum I purchased. This was from SMF software, not Discourse, but I think the patterns might be useful for building spam heuristics.
Telltale signs of spammer:
- They use a lot of '.'s in a gmail address. Basically they can have one spam gmail address and register 50 accounts by randomly sprinkling periods throughout. This is a little tricky because legit users sometimes do the same thing with susie.forumname@gmail.com
So I ran a regex to remove all periods and anything after the “+” sign, and then GROUP BY email address and anyone with more than three accounts tied to the same gmail address got nuked. This is a gmail specific tactic, but it wiped out 10K spammers for me. This particular forum allows users to have multiple accounts, so I allowed users who had two accounts tied to the same email to live. - Users linking to the same url. Required manual inspection again, but there were 4 urls out of 100 that were legit. The other 96 urls were clearly spammers who’d created multiple spam accounts linking to these sites.
- Anyone with a url in their description/about me/homepage/any text box was worth doublechecking.
- Spammers often copy/paste “location”… found a lot of accounts sharing the same typos even. Example “location field = ‘United STates’” A simple SQL group by on location flushed a lot of these out, but I still had to manually review just in case.
For all these tests, I doublechecked that none of the accounts had any PMs or Posts–if they did I manually reviewed the account to verify my filter wasn’t catching spammers.
Other random comments:
I do agree with @codinghorror above that I’d rather give spammers a chance to expose themselves, which means letting them enter profile info.
Google does crawl plain text urls, even non-hyperlinked ones. Excluding through robots.txt makes a lot of sense.
I would never delete inactive users unless they were verifiably spammers. I can personally attest to being one of those users who’s signed up for a forum account, never posted, and then returned years later and started using the account.
Similarly, most forum owners I know would rather have more accounts than less and so would never delete inactive users. A potential forum buyer will generally pay more for more accounts even if they’re inactive. A sophisticated buyer will ask you to run a few SQL queries to see how many users are actually active, but most won’t.
Very helpful, thanks for sharing this.
I would rather educate people about this rather than mindlessly propagate bad ideas.
Why? Why not create the account when you actually need it and will use it? What value is there in an empty, never used account?
If the account is deleted, there is no chance of ever reactivating that user. If the account remains around, there is a small, but still non-zero chance of reactivating them.
I, and most other forum owners that I know who run forums semi-professionally do not want inactive accounts deleted unless they’re spammers. This is not pulling opinion out of thin air–this very topic has been discussed several times on a private forum I belong to for folks who own forums with >2M posts. Each time, folks are like “Well, I’d like to delete these user since 99.9% of them won’t ever return, but I’m not going to touch it because of the 0.1% who might return. And if there’s any chance a buyer will pay more for more users, then also not something I want to touch.”
I do challenge the statement that this is “mindlessly propagating bad ideas”. Most forums these days are sold on multiple of revenue, but a buyer will still look at the user count and think “I’m willing to pay a little more because I can reactivate some of these users with better email campaigns etc and increase activity.” I’ve been successful in doing this myself, so while I ask a seller for a breakdown on active vs inactive users, I am willing to pay a small amount more if there are a bunch of inactive users that I might be able to activate. The more important thing to me is how many of the inactive users have a bounced email address–those are the accounts that are worthless.
This is configurable, so really this is just a question of defaults.
I would argue that, in general, the cost of having a few hundred “prime” usernames parked by dormant never used accounts, with a pretty significant risk of being “spam” bombs activating 1 year in, is worse than the impact of deleting these no-op accounts.
So it really is just a question of which default is saner.
At the very least you would need to determine how many of those 3, 4, 5 year old accounts have valid emails if you are “buying” a bunch of totally inactive signups. We only validate email at signup and after a year of not seeing that user we stop mailing them digests.
Note that at no point does Discourse delete users, nor is there any code to do so in the current code base, provided they validated their email address. Users are given 7 days (default value) to validate email after signup.
(Also if I was selling a forum, and those are the ground rules, I would hire some spammer to create tens of thousands of new accounts with unique email addresses, thus boosting the sale price.)
I misunderstood then; I’m sorry.
I’d thought from what was written previously we were discussing users who had validated their email but never actually posted.
My apologies.
As long as there’s a setting to turn of this auto-deletion (should it ever make it into the codebase) I’m happy.
For users who haven’t validated their email after signup, I’d probably send them a followup email or two as a “reminder to activate your account”, but that’s just me and I certainly understand a sane default is to delete them instead.
Re: Spam bombs–the ones I’ve observed had fully activated accounts. I don’t have a site currently on Discourse, so this may be different for some reason, but on other forum software they make sure to fully activate the account and then go dormant until they think they’ll be off the radar.
Re: hiring a spammer… that’s just one of many risks associated with buying a forum. Bigger risk is whether mods/users will stick around through the change of ownership or head elsewhere.
Since I created this image for @cpradio it also belongs here:
Basically, a TL0 profile is only visible to that user and other registered users, because
I still think it would be better if they just couldn’t post a profile at TL0, as being able to to so will just encourage them to do so, and presumably those who get to TL1 will have their profile visible.
No. As remarked several times earlier in this topic, you want them to
expose themselves, not lie in wait for a year before unleashing a payload.
In Discourse, they have to actually do the work and read the site to get
out of the sandbox, simply existing does not get anyone out of the new user
sandbox, ever.