Human-driven copy-paste spam

jsha · September 14, 2018, 6:36pm

Hi all! I run the forum at https://community.letsencrypt.org/, and have found Discourse to be a huge win over the years. Our forum is so critical to our success.

However, we’ve noticed a new pattern of spam. People will sign up and post comments that seem really legit, then hours later they will edit those comments to add links to sites they are trying to report. When we dig into these, it turns out the “legit” comments were copy-pasted from other threads, or from our GitHub issues page. If you’re Discourse staff, you can read our thread about it (note: It’s in our Lounge, so others won’t be able to see it). As far as I can tell, this form of spam appears to be driven by real humans.

I’m curious, has anyone else been seeing this form of spam? Any ideas about countermeasures, other than flagging? One member of my forum suggested it would be helpful to have a feed of posts that were edited after posting, or particularly posts that were edited to add links.

pfaffman · September 14, 2018, 7:06pm

I think what you should do is change min trust to edit post. The default is zero, so you might bump it up to 1.

codinghorror · September 14, 2018, 7:31pm

This is indeed something to keep an eye out for. Bear in mind we made a few changes to address this, so here’s what I recommend in order of priority:

thanks to @sam, edits of more than {x} characters to a post now always force a full post revision, so there is no normal grace period for editing when the edit is particularly large. The site setting is editing grace period max diff and it defaults to 100 characters. You can try setting this lower.
you can reduce the total amount of time posts are eligible for editing by users, post edit time limit defaults to 86400 which is 60 days. If the users are coming back hours later you could reduce this to say 30 minutes.
you can make the editing grace period setting shorter, it defaults to 300 seconds or 5 minutes.

As a very last resort, you can also disallow lower trust levels from editing their posts altogether as @pfaffman indicated, but I’d adjust the other settings first and only resort to that if there are continued problems.

mpalmer · September 15, 2018, 12:05am

Hmm… I wonder if a concept analogous to Stack Exchange’s “Review Queues” concept might be worth considering. It seems to do an OK job on keeping the worst of the cruft tidied up, and gives high-trust users another way to proactively contribute. Sure, you can do the same job by reading everything, but on high-traffic sites (like I imagine LE probably is these days) reading everything is impactical. Making it easier to surface suspicious stuff, insofar as it can’t be reliably detected algorithmically, seems like a win.

codinghorror · September 15, 2018, 12:27am

That only makes sense on a site where everything is a wiki all the time, though. Definitely not the case in Discourse.

mpalmer · September 15, 2018, 9:08am

I’m not sure why you’re thinking that. Perhaps what we’re thinking of is very different. I’ll describe what’s in my mind grapes.

The sort of review queue-like system I’m thinking of would have criteria like “posts by TL0”, “self-edits by TL0/1”, and other similarly potentially-suspicious things like that. All actions that match those criteria go into the review queue(s), and everyone at TL3 or above (say) get a “Review!” button up the top when there’s something in the queue. The actual review process could look a lot like the SE one – show post, ask “good, bad, or otherwise?”, and users can do all the things they could do on the post if they came across it organically.

Now, this is certainly not necessary for every Discourse out there. If there’s small enough volume of posts, edits, etc, it’s just unnecessary waffle. If a site is low-volume enough that some regulars (probably site admins, even) are reading pretty much every topic as it hits the top of Latest, then it’s completely redundant, because everything is already undergoing review. Low-volume sites can also use the other knobs available, like “review first post”, winding down the edit thresholds, things like that.

The whole thing is quite possibly niche enough that it doesn’t even make sense for it to be in core. However, for sites like Let’s Encrypt, where you’ve got a large volume of posts (and edits), a large group of reasonably engaged regulars, but nobody on the mod/admin side who is so deeply engaged as to want to read every post/edit, I think it would be valuable to have in the arsenal.

Anyone out there feeling like a spot of plugin development? Design’s all done!

codinghorror · September 15, 2018, 10:54am

It would be equivalent if every single Discourse post was wiki by default and editable by anonymous users; that is the correct conceptual equivalent of Stack Exchange.

Pretty far cry from the defaults in Discourse, so I don’t see any equivalence.

mpalmer · September 15, 2018, 11:55am

I’m not saying the underlying environments are equivalent, merely that the solution to this problem in Discourse could be solved by borrowing an element from Stack Exchange.

JagWaugh · September 15, 2018, 12:42pm

Our site is just above the limit of what a single mod can expect to review in a day and still have some semblance of a life.

We don’t have a spam problem, and our meatbags are mostly well behaved.

I have a couple of times had users who didn’t like being asked (by PM) to moderate their language or unacceptable responses to other users, who then went on a deletion spree, trying to remove all their posts (containing useful information) rather than simply toning the individual post I raised an issue with down. Presumably they felt that if they aren’t allowed to be insulting, then they’ll erase everything they’ve written, then unsub.

Some metric/flag/notification which alerted me to high rates of deletion/edits wouldn’t be a bad idea, but probably difficult to model.

jsha · September 15, 2018, 9:03pm

Thanks so much for the input! The Let’s Encrypt forum is indeed high-volume, but we’re lucky enough to have dedicated volunteers that see just about every post, and are good about flagging the spam. So this may be “working as intended;” I just wanted to see if there are any best practices as we see the problem grow.

Also, one point I’m not clear on: If someone edits their post, does it pop up to the top of the topic list as if there were a new post? If so, that increases the likelihood of getting spotted. If not, I could see a lot of these spammers going unnoticed.

Mittineague · September 15, 2018, 9:10pm

Because only edits to the last post of a topic “bump” the topic I think one of the best things to do is tweak the edit time window down to something that would allow editing of “recent” posts but not “old” posts.

True, for what would be a legit edit eg. “I forgot to mention —” it would mean another new post would be needed. But it does make spotting problem posts easier.

YMMV, but 5 hours seems like a fair arbitrary time limit IMHO

JagWaugh · September 16, 2018, 4:59am

How about adding a metric raising a flag/notifying moderators when a user has edited or deleted more than N posts in a specific time period?

This would work for the problem I mentioned above, but not for @jsha if his users are only changing individual posts.

riking · September 16, 2018, 4:51pm

Yes, one of the major parts of the SE review queues is “improve post by editing it”, but that’s not really a blocker when what you’re concerned about is spam - you really just need a “seems good” / “flag” / "escalate"¹ decision, and limit what kinds of things get in the queue in the first place - you’re looking for “spammer”, not “low quality post”.

¹: “escalate” is referring to “I need a second on this” / “I don’t feel comfortable making this decision on my own”. One implementation would be to open the composer to a new #lounge topic linking to the post under review, and have the reviewer write in their concerns. “Is this a new wave of spam?”

edit: hey actually that ^ seems like a good button to put on the admin flag queue page, targetting the #staff category instead…

RoryBlyth · September 16, 2018, 8:02pm

Hi, all!

I’m new here, and I’d like to throw some ideas in here, but the ideas are a little… unsophisticated.

They’re hacky, but maybe they’d help brainstorm ways of buying you time to come up with a tight, long-term solution.

Is that the kind of thing I should post here? Would it be welcome/of interest? I don’t want to accidentally make a mess with my hacky ideas.

sam · September 16, 2018, 10:25pm

Feel free to post your ideas

Personally I am with @codinghorror here just stop allowing people to edit posts after say 10 minutes and see how it impacts dynamics on the forum. I think only a very tiny percentage of legit users will be impacted, and spam bombs will not.

One change I totally support here is a site setting for “disallow non bumping edits”, that way nobody can edit post #3 in a 6 post topic, but the edit window can be left as is and the last post in a topic can be edited for a while.

codinghorror · September 17, 2018, 10:05am

I don’t think I’d support that, it is a very “inside baseball” tweaky kind of setting. Like you said, 99% of this is simply reducing the valid edit windows.

jsha · September 17, 2018, 5:01pm

Thanks very much for all the feedback! I’m going to try @codinghorror’s advice to change the grace period settings.

edits of more than {x} characters to a post now always force a full post revision

What’s the advantage of forcing a full post revision?

mnordhoff · September 17, 2018, 5:32pm

Hi. I’m one of the Let’s Encrypt community moderators.

I do read almost every post; the big issue is when the spammy edit comes after I’ve already read it.

If a thread has an “already read” grey link and it’s two pages down, I usually won’t click on it again.

For me, what would help most is:

A review queue-like view of posts recently edited by TL 0 or 1 users.
If edits to old posts always turned the link black and/or bumped the topic to the top of the page.
A magic AI plagiarism detector that searches GitHub, Reddit and a thesaurus.

Edit: (We’ve gotten some spam with obvious machine thesaurus replacements in the text, but most of it doesn’t try that.)

sam · September 17, 2018, 9:42pm

We can easily do a data explorer query here for TL0 / TL1 edits, I can help you with that (list last 500 posts edited by TL0/1), then a very simple internal process could be running the query say weekly.

Would that help? I can run it now to see how much stuff has fallen between cracks.

Buried edits always bumping would way to aggressive imo.

mnordhoff · September 17, 2018, 11:09pm

I didn’t know about Data Explorer! It looks awesome.

I’m just a moderator, so I can’t speak for what the admins would want to do, and I don’t think I’d have access to that.

I’d guess/hope that very little has slipped past the community, but it would be interesting to know for sure.

Topic		Replies	Views
Spam bots tricking Discourse filter by editing support	28	2785	April 13, 2023
Diagnosing spam attack of 100 topics feature	34	2754	May 29, 2017
Free to edit post at any time feature	34	14193	May 22, 2023
Users editing spam links into their posts after a delay community	15	1681	November 26, 2019
Please update 'Understanding Discourse Trust Levels' — or is there a different doc? support	29	2612	April 23, 2022

Human-driven copy-paste spam

Related Topics