Real-time updating of topics freezes under high activity

ljpp · November 1, 2020, 9:19pm

With these settings UX was better. Yes, there were several “chokes” and a bunch of 429’s were recorded in my Chrome inspector. CPU load was low. But then again, it was rather calm home game (many active members were on-site, not chatting).

I can’t name the dials to turn but, from my rather subjective expeeience:

The code feature is still over protective on the server load. Perhaps a slightly higher server stress level could be allowed.
When the client backs off, the delay is too long from UX perspective. The game goes on and a lot can happen in a minute. The chat goes off sync, with people referring to diffenrent events of the game. (This adds to the problem of different time delays in between real-time vs cable TV vs IPTV vs 20 sec Chromecast buffer etc.)
The users only sees that the chat has stalled, but receives no indication that the site is still online and active. He is more likely to refresh the page, or other things, that adds to the high load.

ljpp · November 20, 2020, 5:58pm

Just to rule things out, I upgraded the server to 8 vCores and 32GB RAM. I set db buffers to 16GB and Unicorns to 16. Other tweaks back to defaults.

Unfortunately the upgrade did not do much. Rapid discussions are constantly freezing, even with mediocre activity.

The performance is miserable nowadays. I guess I need to start looking at Prometheus etc. I am 95% certain that the performance of the software has seriously regressed since v2.3.

Brother @Iceman’s comment was mostly neglected in September. He reports that the chokes happen no matter what hardware he is throwing at it?

Falco · November 20, 2020, 9:25pm

I suspect you may be hitting a Redis bottleneck, but as I said many times we can only be sure if you collect those statistics. Without it we may as well use astrology.

If my suspicion is right, it will also explain that throwing more slow cores and RAM at the problem makes no difference, since Redis is single thread you could only scale by getting high performance cores.

We will release a new image with the final release of 2.6 by the end of the month, and it comes with Redis 6 and new app.yml variables to put those to good use. Let me know if you wanna test that earlier, I can give you instructions for that.

ljpp · November 22, 2020, 5:03pm

Just noticed this on a closed topic. @codinghorror - that is incorrect. What the end user actually gets in a high load situation:

A notification that he is logged out
He is brought to the site index page
The index page has the banner notification of high load

The user is not really logged out though. Usually when one taps back into the active topic, the site will operate as usual.

codinghorror · November 22, 2020, 9:20pm

Yet again, we have no customers reporting this behavior (out of thousands, and many much busier than your site), so further discussion at this point is basically useless – we have no visibility into whatever odd configuration situation or hardware performance strangeness you may have over there.

In the future hopefully that will change and we will have better visibility into the actual problem.

ljpp · November 22, 2020, 9:25pm

I was only reporting what is the actual UI/UX when the high load situation happens. Nothing else.

riking · November 22, 2020, 9:47pm

The behavior should be that they are kept on the topic page and shown a logged-out view, not brought to the home page.

ljpp · November 28, 2020, 4:42pm

You are most likely right. It is Redis. The new base image improves things, but now we are exceeding servers capabilities.

Possibly, but that is not how it works in reality. Just reproduced it a minute ago.

Falco · November 28, 2020, 4:45pm

Well, at least that has a know solution:

ljpp · November 28, 2020, 4:47pm

Solution: Make leaner and meaner code

So if Redis is the bottleneck, how would you scale horizontally?

It still puzzles me what has changed since last season. I can’t see that much organic growth, or increase in game chat popularity. Still our capability to serve has reduced dramatically, and is choking even in the calmest games.

sam · November 28, 2020, 8:58pm

Until you can collect metrics on your historic instance of Discourse and then compare to the metrics you collect on your current install, while maintaining the exact same hardware, this will remain a mystery

The whole difference could be that your VPS provider shifted you from one physical machine to another, or that you acquired a noisy neighbour, or that your VPS is now running 17 vs 13 avg number of co hosted services per machine

ljpp · November 28, 2020, 9:02pm

Please do not speculate on pushing the issue to the VPS provider. UpCloud is one of the best on the market, and they have checked their end for anything out of the ordinary. They advertise on our site and it is not very good PR to have the site stuttering

But there is no historical data, and TBH I was not paying that much attention as everything just worked, until the first exhibition games took place in August. Of course the behavioral patterns of humans have changed thanks to COVID, and who knows what else. I can’t see it in the metrics of our site or server, though.

But this is excellent testing material. Just provided @riking some screenshots on what happens when the server overloading kicks in. I guess you guys don’t see it that often.

codinghorror · November 28, 2020, 11:19pm

Note that nobody is disagreeing with you – we’re just pointing out that a doctor can only do so much to diagnose a patient when the doctor is limited to seeing the patient through a video camera on the internet…

Alec · December 2, 2020, 9:27am

Just wanted to say this was exactly as I experienced when I first set up my site (so its not unique to your site).

Heres a thread I made about it at the time:

This is what caused me to jump up different CPU/Memory options outlined here

Unfortunately, I have not had a chance to properly swap to Hetzner from Digital ocean as I described (started a new job). But will do as soon as I get a chance this month.

The end-user experience of being kicked out of the thread, or remaining in the thread (with the logged our message). Did seem to correlation dependent on load. (more users were sent to site index after a goal scored)

I don’t have enough technical knowledge to be helpful, but felt it might help to know a sporting site with similar peaks of chat like behaviour does lead to a similar issue. But mine (smaller, and younger site) was resolved by further upgrading server.

pfaffman · December 2, 2020, 1:01pm

If you’re interested in having data to make decisions about how to diagnose things going forward, you can install Prometheus exporter plugin for Discourse.

ljpp · December 29, 2020, 8:15pm

Just a brief update:

Installed a new 2 container environment on 2 VPS servers (web_only, data).
Surprisingly (for me) the web_only server is exhausting, while the data is relatively lightly loaded. Both running a 4x vCore 8GB RAM UpCloud.com plan.
Upgraded the web_only to a 6x vCore / 16GB RAM UpCloud.com plan. Increased Unicorns to 18.

Still we are hitting various 429 limiters. The system under high load -mode did not kick in though.

ljpp · January 3, 2021, 8:06pm

The hockey season is ruined by the COVID, and they are now playing a few random games without audience. Since we do have hosting credits with UpCloud.com, we are pushing to improve the experience using what we got. Now running the 6x vCore 16GB for web_only and 4x vCore 8GB for data, unicorns at 18.

We once again disabled the ratelimiter…

DISCOURSE_MAX_REQS_PER_IP_MODE : none

…which helps, but we still get 429’s from POLLs, that produce the long delay/freeze for the end user. We are going to continue tweaking by increasing the DISCOURSE_REJECT_MESSAGE_BUS_QUEUE_SECONDS.

But before we do that, a question to @sam / staff:

Is there an environtment variable to increase the thresold for extreme load - read only mode -limiter, or can it be disabled completely?

sam · January 3, 2021, 11:59pm

This should not be needed, we would love to host you so we can get to the bottom of why this keeps tripping you even though you have such low traffic.

ljpp · January 4, 2021, 8:50pm

Perhaps so, but we would like to be slightly less protective over the server as the naturally occuring activity spikes are very short, and generally stabilize within a minute or so. So adjusting the thresolds just a little bit higher might improve the UX, while waiting for the move.

ljpp · January 12, 2021, 7:35pm

The games have been scarce (thanks to COVID), so we have had very few opportunities to measure and tinker with this.

What we found out that even with our improved hardware resources (6+4 vCores and 16+8GB RAM), even a modestly active crowd is able to produce 429 client freezes. We saw this with the U20 WC games, that attracted about ~50% of our regular game audience for the chats.

With measuring, trial and error we have settled with the following tweaks:

  DISCOURSE_REJECT_MESSAGE_BUS_QUEUE_SECONDS: 0.4
  DISCOURSE_MAX_REQS_PER_IP_PER_MINUTE: 400
  DISCOURSE_MAX_REQS_PER_IP_PER_10_SECONDS: 100

This seems to eliminate 80% of the 429’s, thus enabling a relatively smooth experience for a majority of users.

The next step would have been buying different kind of hardware resources, either using dedicated boxes for single threaded speed or switching to a VPS provider that offers plans with gazzillion vCores. For us however, the next step is to work with the Discourse hosting team, as @sam hinted earlier.

Hopefully these tweaks might be useful for @iceman, @alec or anyone else. Be sure to have an eye on the CPU usage and queuing. Also what I learned from this exercise, is that 2 containers are way better than one - tweaks can be applied with near zero downtime, and hardware resources exploited more granulary.

I am still interested in any new tweaks or findings that might help to improve the performance/UX for fast paced discussions driven by real world events.

Topic		Replies	Views
Improving Instance Performance (Megatopics, Database Size and Extreme Load) Installation	60	4823	October 13, 2020
Extreme load error Installation	19	1487	August 13, 2023
Topic history not loading correctly after migration Support	21	872	February 8, 2023
Discourse installation has been getting slower and slower and slower Installation server-resources	37	1528	May 15, 2023
The MEGATOPIC: public good, or public menace? Community	41	15806	April 4, 2021

Real-time updating of topics freezes under high activity

Related topics