Real-time updating of topics freezes under high activity

We have returned all tweaked values back to defaults and updated to 2.6.0 beta4. Games coming up on Thu and Fri, so we’ll have good test coverage later this week.

2 Likes

@sam

Unfortunately, the fix does not solve the issue. We had a moderately active game with 600 messages. Several freezes were observed by my own testing, as well as by our members. They correlate with game events, ie. activity spikes.

  • CPU usage was well within limits, peaking around 60% and average load around 30%
  • Definitely client side issue. When the chat topic freezes, if you go to index page you’ll see the count of unread posts. Click back to the topic and the posts become visible.

What still puzzles me, and is not covered in this topic, that what has changed since v2.3, which did not have this issue?

The major updates 2.4 and 2.5 happened during our (COVID-extended) off-season, so nobody noticed anything, but the freezing was apparent immediately in the very first pre-season exhibition game.

Any parameter hacks we could try for tomorrow? It’s going to be a hot derby and an away game, so the community will be on fire.

In our case, turning off the Who’s Online plugin and deactivating the rate-limiting file (and I read there were some improvements with the more recent beta) seemed to have done the trick for us.

We have also soccer games now and then with 300 users or a bit more clicking and writing all in the same topic at the same time and it seemed to perform much better during the last game.

Are you on latest version, with the recent fix?

Please please update to tests passed. I refined stuff a lot since beta

2 Likes

Yep, latest beta version (as in up to the last 48h).

Updated. Report will follow.

1 Like

@sam

Unfortunately, still no-go. Sure – the game was heated with 950 messages. I had an eye on GAnalytics, and around 250 people were watching, 119 posted. Several freezes were observed, as before. Message-bus has returned some 429’s, with messages “You performed this action too many times, please wait X minutes”.

CPU load peaked at ~70% and there was virtually zero wait (wa). So while the activity was high, we are still unable to deliver what the hardware would be capable of.

Could you answer to the one question that has puzzled me – what has been implemented after 2.3 that is causing this, what is it supposed to bring to the table?

1 Like

The implementation is largely the same as it always was, except that we have global app rate limits which are configurable, you can raise them if you want it could cause total collapse I don’t know

I don’t understand what you mean by freezes, if stuff gets too busy now it will stop updating, but the difference is that you don’t need a browser refresh to fix the page, it will recover once server has capacity

A bit unclear here, are your users observing zero improvement after my changes?

Does your server have lots of free ram, if so add unicorn workers

What do you have in db_shared_buffers parameter? We had a lot of “unstable” behavior in the beginning (some topics “chewing” a lot, specially when heavily participated) with just the recommended 25% of the total RAM. When we increased it to 16Gb (our of 32Gb) all that instability just went away… and now more recently got even better with the latest changes.

1 Like

Okay, so this phenomena is difficult to monitor in a production environment (game chats), since every game is different – different amount of critical events, different opponent, different emotional charge and so forth.

The issue from our perspective is that our maximum capacity to serve has decreased since 2.3. That is the key here. Every server has it’s limits, but now we are getting less out of ours than we did in March, running 2.3. Based on rough back-end monitoring, the server is not able to reach 100% load or capacity.

What the end users sees is that the chat flow simply stops, without an UI indication of whats going on. That causes confusion.

I am fairly certain that the changes in the tests-passed have improved the situation, but the performance or maximum output is still signifacntly lower than with 2.3.

We have a VPS with 6 fast cores, and 16GB RAM. Unicorns are at 12, RAM buffer related settings are at defaults.

I think the best next step here is to set up historical monitoring of your system so that we can figure out where the bottleneck is, because we’ve established that it isn’t CPU time. It’s always possible that you’re maxing out your network connection!

plus more traditional server metrics like node-exporter.

2 Likes

If this is the case and you want to push it harder.

  1. You can reduce rate limits, this will allow users to interact more aggressively with Discourse. Specifically you could double DISCOURSE_MAX_REQS_PER_IP_PER_MINUTE and DISCOURSE_MAX_REQS_PER_IP_PER_10_SECONDS

  2. You can try adding more unicorn workers

This is expected temporarily while you are overloaded, but stuff should automatically recover once load is reduced.

My guess here is that this is all just rate limit related, the rate limits are new, and there to protect the server, it appears your server is being protected by design.

2 Likes

We tried a game with…

DISCOURSE_REJECT_MESSAGE_BUS_QUEUE_SECONDS: 0.3
DISCOURSE_MAX_REQS_PER_IP_MODE: none

…and when the emotions started to warm up in the 3rd period, things got worse. We reached the limits of ours servers, users were constantly pushed to logged out - mode and the game chat was freezing as well.

It was a great success story for 4 years, but now we are in a very tight spot. Jumping to the next level of VPS capacity would take us to the ~160€/month price category, which is a challenge for a hobby site. We are not talking about huge user volumes either - 116 people posted 800+ messages during the game.

Don’t do chats-ideology is not suitable either. If we would not have those, emotional reaction posts would scatter all over the more “serious” topics. They are an important tool for channeling the live situation emotional charge to a single topic, keeping the more analytical topics clean.

Mine is a football forum and I have experienced similar challenges.

Basically what I found was it was scalable issue.

The issues for me kicked in at different levels.

Digital Ocean
1 CPU 1 GB = 30 -40 users in chat like situation
2 CPUs and 2 GB= 70-80 users in chat like situation
4 CPUs and 8 GB = fine for 120 users and 1000 posts in 2 hours. Didn’t reach limit.

I am trying the different step up levels with Hetzner (Mirroring site) as cheaper and didn’t go as smoothly as hoped.

My experience so far is
3 CPU (CPX 21 AMD chip) and 4Gb = struggling with 20 users
2 CPU (Intel) and 8gb = no issue with 20 users.

About to test with 80 to 100 simultaneous users under match conditions.

When I looked at CPU usage with Digital Ocean, even under stress CPU usage seemed fairly low <50% at all times at all tiers.

When I looked at CPU for Hetzner for the AMD chip I was seeing median CPU usage of say 60% but every minute or so a short spike upto 300% of CPU usage. This didn’t seem to occur with Intel chip.

What this means, I don’t know. I suspect CPU monitoring better with Hetzner (capturing short spikes). But overall CPU usage seems well balanced. DO on face value appears to deal better with spikes, but I should have more information on Hetzner after this weekend.

4 Likes

I should also add with Hetzner test the whose online plugin didn’t make any difference.

But discourse quick messages plugin seemed to be detrimental.

The next game is due tomorrow. I have removed our own hacks and we are trying with these.

Also as a total long shot, I have incread the db_shared_buffers from 4GB (25%) to 6GB (37,5%). I also uncommented the db_work_mem 40MB line from app.yml (this is by the way a very vaguely documented option, while still being presented for the admin as a some sort of opportunity for improvement).

I no longer expect to find a solution for the problem, but only better damage control – a set of parameters that has the least negative UX impact for the end users. In the mean time, I’ll have to figure out the possibilities to further increase our hosting resources.

2 Likes

Question to @sam & other developers.

How does the forever growing size of the database impact this use case, where users hammer a single topic for a couple of hours?

I had a look at historical game chat activity and noticed that we had games with huge statistics back in 2017, when our server had a fraction of the resources we have to today. We had games where post counts reached 1600 messages by 165 users and nobody had any complaints about the performance. Now we can’t serve a half of that, with a much more powerful server.

You might try upping it to 80MB. Maybe instead of the other change.

1 Like

This is one point we are actively working on all the time.

When Discourse was new, almost all sites had a brand new database so the database could fit in memory easily. Now, a few years later, some sites have over 100GB databases and RAM sizes that are not even a tenth of that.

One upcoming update in the next few weeks is the PostgreSQL 13 upgrade that will reduce the largest object size in half.

Other than that, step 0 into debugging your performance issues is gathering data with Prometheus exporter plugin for Discourse so we are not flying blind.

8 Likes