Real-time updating of topics freezes under high activity

Please do not speculate on pushing the issue to the VPS provider. UpCloud is one of the best on the market, and they have checked their end for anything out of the ordinary. They advertise on our site and it is not very good PR to have the site stuttering :smiley:

But there is no historical data, and TBH I was not paying that much attention as everything just worked, until the first exhibition games took place in August. Of course the behavioral patterns of humans have changed thanks to COVID, and who knows what else. I can’t see it in the metrics of our site or server, though. :man_shrugging:

But this is excellent testing material. Just provided @riking some screenshots on what happens when the server overloading kicks in. I guess you guys don’t see it that often.

1 Like

Note that nobody is disagreeing with you – we’re just pointing out that a doctor can only do so much to diagnose a patient when the doctor is limited to seeing the patient through a video camera on the internet… :movie_camera:

3 Likes

Just wanted to say this was exactly as I experienced when I first set up my site (so its not unique to your site).

Heres a thread I made about it at the time:

This is what caused me to jump up different CPU/Memory options outlined here

Unfortunately, I have not had a chance to properly swap to Hetzner from Digital ocean as I described (started a new job). But will do as soon as I get a chance this month.

The end-user experience of being kicked out of the thread, or remaining in the thread (with the logged our message). Did seem to correlation dependent on load. (more users were sent to site index after a goal scored)

I don’t have enough technical knowledge to be helpful, but felt it might help to know a sporting site with similar peaks of chat like behaviour does lead to a similar issue. But mine (smaller, and younger site) was resolved by further upgrading server.

1 Like

If you’re interested in having data to make decisions about how to diagnose things going forward, you can install Prometheus exporter plugin for Discourse.

Just a brief update:

  • Installed a new 2 container environment on 2 VPS servers (web_only, data).
  • Surprisingly (for me) the web_only server is exhausting, while the data is relatively lightly loaded. Both running a 4x vCore 8GB RAM UpCloud.com plan.
  • Upgraded the web_only to a 6x vCore / 16GB RAM UpCloud.com plan. Increased Unicorns to 18.

Still we are hitting various 429 limiters. The system under high load -mode did not kick in though.

The hockey season is ruined by the COVID, and they are now playing a few random games without audience. Since we do have hosting credits with UpCloud.com, we are pushing to improve the experience using what we got. Now running the 6x vCore 16GB for web_only and 4x vCore 8GB for data, unicorns at 18.

We once again disabled the ratelimiter…

DISCOURSE_MAX_REQS_PER_IP_MODE : none

…which helps, but we still get 429’s from POLLs, that produce the long delay/freeze for the end user. We are going to continue tweaking by increasing the DISCOURSE_REJECT_MESSAGE_BUS_QUEUE_SECONDS.

But before we do that, a question to @sam / staff:

Is there an environtment variable to increase the thresold for extreme load - read only mode -limiter, or can it be disabled completely?

This should not be needed, we would love to host you so we can get to the bottom of why this keeps tripping you even though you have such low traffic.

2 Likes

Perhaps so, but we would like to be slightly less protective over the server as the naturally occuring activity spikes are very short, and generally stabilize within a minute or so. So adjusting the thresolds just a little bit higher might improve the UX, while waiting for the move.

The games have been scarce (thanks to COVID), so we have had very few opportunities to measure and tinker with this.

What we found out that even with our improved hardware resources (6+4 vCores and 16+8GB RAM), even a modestly active crowd is able to produce 429 client freezes. We saw this with the U20 WC games, that attracted about ~50% of our regular game audience for the chats.

With measuring, trial and error we have settled with the following tweaks:

  DISCOURSE_REJECT_MESSAGE_BUS_QUEUE_SECONDS: 0.4
  DISCOURSE_MAX_REQS_PER_IP_PER_MINUTE: 400
  DISCOURSE_MAX_REQS_PER_IP_PER_10_SECONDS: 100

This seems to eliminate 80% of the 429’s, thus enabling a relatively smooth experience for a majority of users.

The next step would have been buying different kind of hardware resources, either using dedicated boxes for single threaded speed or switching to a VPS provider that offers plans with gazzillion vCores. For us however, the next step is to work with the Discourse hosting team, as @sam hinted earlier.

Hopefully these tweaks might be useful for @iceman, @alec or anyone else. Be sure to have an eye on the CPU usage and queuing. Also what I learned from this exercise, is that 2 containers are way better than one - tweaks can be applied with near zero downtime, and hardware resources exploited more granulary.

I am still interested in any new tweaks or findings that might help to improve the performance/UX for fast paced discussions driven by real world events.

1 Like

We just landed a big performance improvement for sites with active posting and many users yesterday, it should help a lot on your site.

https://github.com/discourse/discourse/commit/b1f32f2f5717c4f55b902485794e62b8cecd8522

9 Likes

Very good, we’ll have a look and potentially test this.

1 Like

Well, every game is an individual case. Now in the COVID situation (empty arena) and near-random game schedule the behavior of the audience is impossible to predict or compare to historical data.

Based on this single game, I can’t say that this change brought us significant improvement.

The 1st perioid was calm and fine, but events during the second caused a spike in messages and increase in lurkers. About 60% of our people said they experienced freezes.

In the two server setup, the web_only is the only one reporting high CPU usage and load average.

The extreme load / read-only -mode was not triggered, which is good as it is the most painful UX. Overall the audience has quickly learned to visit the index page and come back to resume the discussion - which generates more server load. If only the end user could somehow be informed that he is being throttled. Then he would be more likely to actually wait a minute.

Progress report from the private conversations: the experience was improved by setting DISCOURSE_REJECT_MESSAGE_BUS_QUEUE_SECONDS to 4, and we are planning some core changes to improve the message bus ratelimiting behavior.

6 Likes

Since we experience some similarities with @ljpp situation, although in a significantly lesser extension (almost exclusively in some 5 minutes around the time when the matches finish), I would like to know if there are any tweaks one can make to the threshold where the message of high load kicks in and users start being “kicked out” of the topic… because it’s always around one single topic, the match topic.

That and the 502 error message (purely nginx message) that we experience even more rarely in the same context. I suspect there are probably some configs in nginx that might benefit from some tuning, and I know it’s not your job, but I’m all ears for good suggestions :laughing:.

Please clarify - are you experiencing freezes (topic not updated for new posts) or are you getting extreme load error messages?

There are tweaks in this thread that provide some improvements for the freezing, but they also increase the system load, so you are more likely to get extreme load scenarios.

3 Likes

We experience sometimes topic freezing in those situations I’ve reported, but when that happens the system also shows warnings of extreme load. So I can’t tell you what is what.

We don’t mind extreme load as long as it doesn’t kick people from topics or interrupt update for new posts. We would actually prefer in that case to have it slowly loading stuff (the wheel could spin for 15 seconds for each user to read/post and we would prefer that to freezing or user being kicked out).

4 Likes

I have to agree. The extreme load UX is confusing for the end user.

  • How many concurrent users you have?
  • What kind of hardware?
  • Link to your forum stats?

@sam

As we are now on the CDCK SaaS platform, I can only observe this from the UX point of view.

We have had some good heat in the games during the last couple of weeks. The “freezes” have pretty much disappeared with the platform change, but there is this fluxuation in the way that the topic gets updated, which may still be confusing to some. But the audience has mostly (90%) stopped complaining and is focusing on the games, which is a good sign.

There is however a scenario which I can reproduce with fairly high (again 90%) confidence. The platform has occasional issues in resuming the session, when the game topic is in a background tab (Android) or under a locked screen. When I get back to the busy topic, usually due to an interesting event in the game, the topic view is not sometimes updated. I can see user avatars blinking at the bottom of the topic, but no posts are appearing. One needs to refresh the browser to fully recover.

The repro pattern is not the easiest, as you need:

  • A busy topic
  • Some good action in the game → more heat to the topic
  • Keep the topic under locked screen or on the background browser tab.
3 Likes

We suffer from that too.

Another thing is, when jumping to the first unread post, it can repeat this behaviour a few times (going to the same “unread post” a few times, although the first unread post position should have changed in each occasion).

To exemplify:

  1. I jump to the first unread post
  2. scroll and read the 100 unread posts
  3. then go to another topic or homepage…
  4. after a minute or so, there are like 30 new unread posts, but when I click on the icon, I’m thrown once gain to the position on 1 (meaning 130 posts backwards and not just the new unread 30).

But, once more, it only happens in very very busy topics during some minutes at the greatest peak of refresh and posting by every user all in the same topic at the same time. Kind of annoying but not a dealbreaker so far.

1 Like

I would consider that a success.

Can you provide a repro here on meta? Probably not since it requires a large number of active users idling in the same topic at the same time?

My current thinking is we should build a live chat feature and instantiate it just-in-time, when you have…

  • lots of users

  • in the same topic

  • at the same time

  • then, and only then, instantiate a live chat box overlay and strongly push users into using that instead of replies, maybe even disable the ability to reply to the topic with

    :loudspeaker: Hey, it looks like what you really wanted was a chatroom… here it is, have fun! :speech_balloon:

14 Likes