freeCodeCamp.org Discourse is Collapsing from Spammer Scripts

We started seeing heightened loads on the freeCodeCamp.org Discourse on June 21. Our average CPU load started to grow to the point that the entire forum became unresponsive.

Our Initial investigation

  1. We updated Discourse from the Stable version to the latest Tests-Passed (Beta) version, which is recommended to get all the latest performance fixes. There seemed to not be much of a difference.

  2. We got rid of some of the old plugins to help diagnose the issue, and it does not seem to be the case for slow down.

  3. To rule out any misconfigurations, we tested this with the default theme and checked the /logs and did find some exceptions:

    Job exception: could not obtain connection from the pool within 5.000 seconds (awaited 5.006 seconds); all pooled connections were in use

This seems to be because the instance may already stressed out with a lot of long running jobs.

The discourse container (upstream) is running on Digital Ocean, and seems to be running hot on all 6 vCPUs. It is currently running on version: 2.5.0.beta7

Further investigation

  1. This morning our moderators let us know that there was a significant increase in the number of spam accounts, which was leading us to believe that this could be a targeted attack.

  2. To try and mitigate the issue, we have put the forum into read-only mode. Despite this, the resource utilization rate remains quite high at 60+%.

  3. From what we can tell: ever since this started, we have been seeing a lower requests-per-second and higher current request rate on our proxy:

  1. The volume of traffic to freecodecamp.org/forum has not significantly changed according to Google Analytics.
  2. None of the other applications on the same proxy seem to be affected. Our /news and /learn platform are operating normally.
  3. We tried adding some additional rate limiting on the proxy, and it did not make much difference, so we removed it (to allow normal users to continue to reach our site).
  4. We have also moved our old forum.freecodecamp.com redirects away from the .org forum since we noticed that old subdomain (which we haven’t used in 2.5 years) had a big spike in traffic starting a few days ago.

At the moment with ~400 real-time users, the stats look like so:

Observations About The Spammer’s Behavior

The spammers seem to be running some sort of script that creates new accounts on a Discourse instance, then waits several days. Then those accounts suddenly become active. They start posting new threads with links to a website. (Perhaps to build backlinks?)

Some of these accounts were created all the way back in March.

Here’s what one of their spam posts look like, though there are a lot of variants, linking to a lot of websites:

Any advice would be welcome here.

16 Likes

This deserves much more attention

Thank you so much for sharing!

One of my favorite Discourse instances, freeCodeCamp, has slowed to a halt the last two days.

5 Likes

I don’t think that the heavy load main problem caused by spammer.
Here is why:

Normal discourse usage (without API) can only work with modern browser because it heavily depend on javascript. Google Analytics use javascript too to collect various user informations (and doesn’t need modern javascript support to run the analytics code). If Google Analytics can’t detect user activity, then discourse should not be able to serve its content and features.

Here is bot capture when I use old headless browser library (phantomjs) to access discourse site :

And here is with more modern library (puppeteer)

  1. It’s just strange if someone able to post (without API) replies on discourse while google analytics can’t detect them.
  2. Usually, spammers use public proxy to do dirty things and I think your cloudflare is smart enough to stop “bad visitors” before really reach your app.

Did you see high queue jobs on sidekiq process?

2 Likes

This ignores the fact that more than half of people in tech communities are running some sort of Ad Block extension, which always block Google Analytics by default.

19 Likes

Including the new macOS.

Was nasty rumour only. See below.

5 Likes

Is akismet enabled? Can you turn on moderation for trust level 1?

2 Likes

Yes, we did see the jobs kicking up and down as a part of my initial investigation. We then proceeded to revoke all of the user keys.

It does not seem have had done any change on the stability though.

Although we are working very closely with the Discourse team to resolve this.

5 Likes

I don’t see why it’s strange if there can be requests without js based analytics showing anything.
The spammer can use the same endpoints as the javascript does, without javascript. No javascript based analytics would trigger

3 Likes

This is far more likely to be a plug-in interaction or bad proxying config. We see zero success from “spammers” on our hosting.

Given it is a coding site it could also be “coders” trying to do something weird on the forum as well?

But no — spammers aren’t a generalized problem across the thousands of sites we host.

3 Likes

Yes, I am inclined its either config (bad plugin or proxing setup) or someone messing with forum.

The later is more probable cause from all the patterns.

From what I have noticed is that the spam accounts have been created over a long period of time and there are trying to add links (to gain backlinks??) in their bios and all sorts of weirdness.

Also, there could be scraping involved because we have put the site in read-only and have a cache setup on the proxy too along with rate limiting. The resource usage seems to be high regardless on the upstream container.

However, I would not rule out our config could be bad as well, we use subpaths and Cloudflare on top of our reverse proxy which traditionally is not the most efficient setup that Discourse recommends.

1 Like

image

42.5% steal is really high, even if you’re the one that is causing the issues on that hypervisor. This looks like a noisy neighbour to me. If I were you I would contact DO and ask them to move the droplet to another hypervisor.

10 Likes

I’m sure you probably doing that but just to be sure I’d suggest monitoring open tcp / udp connections summarized by IP on OS level. If there is high CPU load it should show you massive amount of open connections to the webserver.

Any strange pattern in production.log?

1 Like
4 Likes

Hi Quincy @ossia

Let’s take a quick step back for a second and look at this from a professional cybersecurity perspective, sans the speculation and “grasping at straws” approach.

The key concept in all cybersecurity tasks is the concept of “situational awareness” so in this case, it’s called “cyber situational awareness” (CSA).

In order to know “what is happening”, in a definitive way, you need to develop the best situational knowledge you can without speculation or guessing. Just the facts.

How do you do this?

Well, very briefly:

Well, very briefly:

We do this by fusing information from all our sensors, and for web-based applications this normally comes from the log files and the session data. I don’t think (off the top of my head) that discourse maintains session information in the PG database (the last time I checked there was no session table like in some LAMP web apps), but that’s not a showstopper at all.

You have most everything you need in the nginx log files for both your reverse proxy outside the container (I recall reading in this topic that you were using nginx as a proxy) and the same logging information is also in the container. In both setups, the log file is here, in the standard OOTB setup:

Here is an example in one of our setups (outside the container) for the reverse proxy:

# cd /var/log/nginx
# ls -l 
total 779964
-rw-r----- 1 www-data adm         0 Jun 17 06:25 access.log
-rw-r----- 1 www-data adm 660766201 Jun 25 18:26 access.log.1
-rw-r----- 1 www-data adm 107367317 Jun 17 03:18 access.log.2.gz
-rw-r----- 1 www-data adm  21890638 May 21 03:08 access.log.3.gz
-rw-r----- 1 www-data adm   7414232 May  5 07:26 access.log.4.gz
-rw-r----- 1 www-data adm     63289 Apr 18 09:12 access.log.5.gz
-rw-r----- 1 www-data adm         0 Jun 17 06:25 error.log
-rw-r----- 1 www-data adm    904864 Jun 25 18:19 error.log.1
-rw-r----- 1 www-data adm     96255 Jun 17 03:17 error.log.2.gz
-rw-r----- 1 www-data adm     79065 May 21 02:58 error.log.3.gz
-rw-r----- 1 www-data adm     70799 May  5 06:54 error.log.4.gz
-rw-r----- 1 www-data adm      1977 Apr 18 05:49 error.log.5.gz

Here is the same basic information logging inside the discourse container:

# cd /var/discourse/
# ./launcher enter socket
# cd /var/log/nginx
# ls -l
total 215440
-rw-r--r-- 1 www-data www-data  87002396 Jun 25 18:28 access.log
-rw-r--r-- 1 www-data www-data 101014650 Jun 25 08:02 access.log.1
-rw-r--r-- 1 www-data www-data   8217731 Jun 24 08:02 access.log.2.gz
-rw-r--r-- 1 www-data www-data   6972317 Jun 23 07:53 access.log.3.gz
-rw-r--r-- 1 www-data www-data   3136381 Jun 22 07:50 access.log.4.gz
-rw-r--r-- 1 www-data www-data   2661418 Jun 21 07:45 access.log.5.gz
-rw-r--r-- 1 www-data www-data   5098097 Jun 20 07:38 access.log.6.gz
-rw-r--r-- 1 www-data www-data   6461672 Jun 19 07:40 access.log.7.gz
-rw-r--r-- 1 www-data www-data         0 Jun 25 08:02 error.log
-rw-r--r-- 1 www-data www-data         0 Jun 24 08:02 error.log.1
-rw-r--r-- 1 www-data www-data        20 Jun 23 07:53 error.log.2.gz
-rw-r--r-- 1 www-data www-data       254 Jun 23 02:36 error.log.3.gz
-rw-r--r-- 1 www-data www-data        20 Jun 21 07:45 error.log.4.gz
-rw-r--r-- 1 www-data www-data        20 Jun 20 07:38 error.log.5.gz
-rw-r--r-- 1 www-data www-data        20 Jun 19 07:40 error.log.6.gz
-rw-r--r-- 1 www-data www-data       274 Jun 18 15:40 error.log.7.gz

Note: That “in container” info is also available from outside the container on the shared volume.

Hence (and to keep this reply short), @ossia, just about everything you need to gain situational knowledge of what is happening is in these robust log files. No speculation is necessary. The data is all there.

There is even more great data available in the rails log, for example. on one of our setups here is the rails production log:

tail -f /var/discourse/shared/socket/log/rails/production.log

The rails log has a lot of great user logging information as well, for example:

Started GET "/embed/comments?topic_id=378686" for 73.63.114.60 at 2020-06-25 18:36:15 +0000
Started GET "/embed/comments?topic_id=378686" for 195.184.106.202 at 2020-06-25 18:36:16 +0000
Started GET "/embed/comments?topic_id=378686" for 17.150.212.174 at 2020-06-25 18:36:16 +0000
Started GET "/embed/comments?topic_id=378686" for 76.235.99.73 at 2020-06-25 18:36:18 +0000
Started GET "/embed/comments?topic_id=378686" for 124.253.211.42 at 2020-06-25 18:36:19 +0000
Started GET "/embed/comments?topic_id=378686" for 103.96.30.11 at 2020-06-25 18:36:21 +0000
Started GET "/embed/comments?topic_id=378686" for 72.191.206.59 at 2020-06-25 18:36:22 +0000
Started GET "/embed/comments?topic_id=378686" for 68.252.68.76 at 2020-06-25 18:36:23 +0000
Started GET "/embed/comments?topic_id=378686" for 69.17.252.83 at 2020-06-25 18:36:23 +0000
Started GET "/embed/comments?topic_id=378686" for 98.109.33.230 at 2020-06-25 18:36:24 +0000

Note: Here (above, as an example) we see the IP addresses of clients pulling the discourse embedded code from another server.

The task at hand....

Back to the task at hand, the “trick” is to move past speculation and guessing, and to do the fun (1) filtering / data cleansing, (2) data fusion and (3) analysis of your sensor data ( logfiles ) to create (4) the situational awareness (SA) of what is happening at your site.

For older LAMP apps, I actually have custom code I wrote years ago which writes all this information to a DB table and does the analysis in real-time and counts the “hits” by IP address (as one example) where I can quickly see what and who and from where is hitting the site, because it does take some code to do this kind of data cleansing, filtering and fusion. (Useful during DDOS attacks, and rogue bot activity, for example).

That’s no problem for you @ossia because you are freeCodeCamp.org so you have both the knowledge to find great log file analysis tools (there are many out there is the cyberverse) and/or create your own custom code to do the analysis quickly and easily based on the scenario you wish to understand (your topic and issue).

I wrote my custom code for an old legacy LAMP app in a few hours many years ago, and I’m no coding genius by any stretch of the imagination, even thought I am sometimes referred to as a “legend” by many in the cybersecurity field, LOL :slight_smile:

To summarize....

Well, to summarize…

You have all the data you need to create deep situational knowledge of “what is going on” on your site and. you can create that SA by cleaning, filtering, fusing and doing some basic analysis of your logfile data. There are tools out there which can help, but I always find it easier to bang out some custom code based on the objective of the analysis (dependent analysis), YMMV. but you can easily do this because you are freeCodeCamp.org and have a lot of tech skills.

I recommend you move alway from trying to gain SA from Google Analytics and other JS based third party apps. Nothing is better than your own web log files (and DB session data if you have it) and you don’t have to worry about “what may or may not be blocked” etc. Your web server log files contain the data to gain the CSA you need (and can also be customized when needed).

In some of my CSA code, I actually intercept the session info and log information from the HTTP requests not logged by nginx, apache2 and other web servers (for additional info); but I have not written this kind of code for discourse (yet) as I’m not as “easy as pie” of a discourse plugin developer (like the meta discourse team gurus here) as I am with LAMP apps, having only started with discourse a few months ago and have not yet written any custom CSA code for discourse (and trying to code less this year, to be honest).

CSA is based on the fusion of sensor data and from CSA comes the knowledge to understand what actions you need to take to remediate any cybersecurity issue.

All the best in your quest and hope this helps you to have more rest :slight_smile:

Cheers!


Original (Historical) CSA Reference:

Original (Historical) CSA Reference:

https://www.researchgate.net/publication/220420389_Intrusion_Detection_Systems_and_Multisensor_Data_Fusion

(Reference only for people interested in the origins and core tech of CSA)

14 Likes

Thanks for your hard work, team!

& thank you so so so much
for simplifying that
double :nauseated_face: toolbar

4 Likes

I am just going to close the loop here.

Discourse are now hosting https://forum.freecodecamp.org/. The site is ultra snappy, spammer scripts are no longer causing even a blip. We remain somewhat unclear about what issue it had on digital ocean, there may have been a noisy neighbor, the machine may have been underpowered, there may have been machine gremlins, we are not sure. But the OP now is 100% resolved and the community is very happy.

16 Likes