Real-time updating of topics freezes under high activity

What do you have in db_shared_buffers parameter? We had a lot of “unstable” behavior in the beginning (some topics “chewing” a lot, specially when heavily participated) with just the recommended 25% of the total RAM. When we increased it to 16Gb (our of 32Gb) all that instability just went away… and now more recently got even better with the latest changes.

1 Like

Okay, so this phenomena is difficult to monitor in a production environment (game chats), since every game is different – different amount of critical events, different opponent, different emotional charge and so forth.

The issue from our perspective is that our maximum capacity to serve has decreased since 2.3. That is the key here. Every server has it’s limits, but now we are getting less out of ours than we did in March, running 2.3. Based on rough back-end monitoring, the server is not able to reach 100% load or capacity.

What the end users sees is that the chat flow simply stops, without an UI indication of whats going on. That causes confusion.

I am fairly certain that the changes in the tests-passed have improved the situation, but the performance or maximum output is still signifacntly lower than with 2.3.

We have a VPS with 6 fast cores, and 16GB RAM. Unicorns are at 12, RAM buffer related settings are at defaults.

I think the best next step here is to set up historical monitoring of your system so that we can figure out where the bottleneck is, because we’ve established that it isn’t CPU time. It’s always possible that you’re maxing out your network connection!

plus more traditional server metrics like node-exporter.

2 Likes

If this is the case and you want to push it harder.

  1. You can reduce rate limits, this will allow users to interact more aggressively with Discourse. Specifically you could double DISCOURSE_MAX_REQS_PER_IP_PER_MINUTE and DISCOURSE_MAX_REQS_PER_IP_PER_10_SECONDS

  2. You can try adding more unicorn workers

This is expected temporarily while you are overloaded, but stuff should automatically recover once load is reduced.

My guess here is that this is all just rate limit related, the rate limits are new, and there to protect the server, it appears your server is being protected by design.

2 Likes

We tried a game with…

DISCOURSE_REJECT_MESSAGE_BUS_QUEUE_SECONDS: 0.3
DISCOURSE_MAX_REQS_PER_IP_MODE: none

…and when the emotions started to warm up in the 3rd period, things got worse. We reached the limits of ours servers, users were constantly pushed to logged out - mode and the game chat was freezing as well.

It was a great success story for 4 years, but now we are in a very tight spot. Jumping to the next level of VPS capacity would take us to the ~160€/month price category, which is a challenge for a hobby site. We are not talking about huge user volumes either - 116 people posted 800+ messages during the game.

Don’t do chats-ideology is not suitable either. If we would not have those, emotional reaction posts would scatter all over the more “serious” topics. They are an important tool for channeling the live situation emotional charge to a single topic, keeping the more analytical topics clean.

Mine is a football forum and I have experienced similar challenges.

Basically what I found was it was scalable issue.

The issues for me kicked in at different levels.

Digital Ocean
1 CPU 1 GB = 30 -40 users in chat like situation
2 CPUs and 2 GB= 70-80 users in chat like situation
4 CPUs and 8 GB = fine for 120 users and 1000 posts in 2 hours. Didn’t reach limit.

I am trying the different step up levels with Hetzner (Mirroring site) as cheaper and didn’t go as smoothly as hoped.

My experience so far is
3 CPU (CPX 21 AMD chip) and 4Gb = struggling with 20 users
2 CPU (Intel) and 8gb = no issue with 20 users.

About to test with 80 to 100 simultaneous users under match conditions.

When I looked at CPU usage with Digital Ocean, even under stress CPU usage seemed fairly low <50% at all times at all tiers.

When I looked at CPU for Hetzner for the AMD chip I was seeing median CPU usage of say 60% but every minute or so a short spike upto 300% of CPU usage. This didn’t seem to occur with Intel chip.

What this means, I don’t know. I suspect CPU monitoring better with Hetzner (capturing short spikes). But overall CPU usage seems well balanced. DO on face value appears to deal better with spikes, but I should have more information on Hetzner after this weekend.

4 Likes

I should also add with Hetzner test the whose online plugin didn’t make any difference.

But discourse quick messages plugin seemed to be detrimental.

The next game is due tomorrow. I have removed our own hacks and we are trying with these.

Also as a total long shot, I have incread the db_shared_buffers from 4GB (25%) to 6GB (37,5%). I also uncommented the db_work_mem 40MB line from app.yml (this is by the way a very vaguely documented option, while still being presented for the admin as a some sort of opportunity for improvement).

I no longer expect to find a solution for the problem, but only better damage control – a set of parameters that has the least negative UX impact for the end users. In the mean time, I’ll have to figure out the possibilities to further increase our hosting resources.

2 Likes

Question to @sam & other developers.

How does the forever growing size of the database impact this use case, where users hammer a single topic for a couple of hours?

I had a look at historical game chat activity and noticed that we had games with huge statistics back in 2017, when our server had a fraction of the resources we have to today. We had games where post counts reached 1600 messages by 165 users and nobody had any complaints about the performance. Now we can’t serve a half of that, with a much more powerful server.

You might try upping it to 80MB. Maybe instead of the other change.

1 Like

This is one point we are actively working on all the time.

When Discourse was new, almost all sites had a brand new database so the database could fit in memory easily. Now, a few years later, some sites have over 100GB databases and RAM sizes that are not even a tenth of that.

One upcoming update in the next few weeks is the PostgreSQL 13 upgrade that will reduce the largest object size in half.

Other than that, step 0 into debugging your performance issues is gathering data with Prometheus exporter plugin for Discourse so we are not flying blind.

8 Likes

@sam

With these settings UX was better. Yes, there were several “chokes” and a bunch of 429’s were recorded in my Chrome inspector. CPU load was low. But then again, it was rather calm home game (many active members were on-site, not chatting).

I can’t name the dials to turn but, from my rather subjective expeeience:

  • The code feature is still over protective on the server load. Perhaps a slightly higher server stress level could be allowed.
  • When the client backs off, the delay is too long from UX perspective. The game goes on and a lot can happen in a minute. The chat goes off sync, with people referring to diffenrent events of the game. (This adds to the problem of different time delays in between real-time vs cable TV vs IPTV vs 20 sec Chromecast buffer etc.)
  • The users only sees that the chat has stalled, but receives no indication that the site is still online and active. He is more likely to refresh the page, or other things, that adds to the high load.

Just to rule things out, I upgraded the server to 8 vCores and 32GB RAM. I set db buffers to 16GB and Unicorns to 16. Other tweaks back to defaults.

Unfortunately the upgrade did not do much. Rapid discussions are constantly freezing, even with mediocre activity.

The performance is miserable nowadays. I guess I need to start looking at Prometheus etc. I am 95% certain that the performance of the software has seriously regressed since v2.3.

Brother @Iceman’s comment was mostly neglected in September. He reports that the chokes happen no matter what hardware he is throwing at it?

I suspect you may be hitting a Redis bottleneck, but as I said many times we can only be sure if you collect those statistics. Without it we may as well use astrology.

If my suspicion is right, it will also explain that throwing more slow cores and RAM at the problem makes no difference, since Redis is single thread you could only scale by getting high performance cores.

We will release a new image with the final release of 2.6 by the end of the month, and it comes with Redis 6 and new app.yml variables to put those to good use. Let me know if you wanna test that earlier, I can give you instructions for that.

5 Likes

Just noticed this on a closed topic. @codinghorror - that is incorrect. What the end user actually gets in a high load situation:

  1. A notification that he is logged out
  2. He is brought to the site index page
  3. The index page has the banner notification of high load

The user is not really logged out though. Usually when one taps back into the active topic, the site will operate as usual.

Yet again, we have no customers reporting this behavior (out of thousands, and many much busier than your site), so further discussion at this point is basically useless – we have no visibility into whatever odd configuration situation or hardware performance strangeness you may have over there.

In the future hopefully that will change and we will have better visibility into the actual problem.

I was only reporting what is the actual UI/UX when the high load situation happens. Nothing else.

The behavior should be that they are kept on the topic page and shown a logged-out view, not brought to the home page.

You are most likely right. It is Redis. The new base image improves things, but now we are exceeding servers capabilities.

Possibly, but that is not how it works in reality. Just reproduced it a minute ago.

1 Like

Well, at least that has a know solution: :moneybag:

3 Likes