Real-time updating of topics freezes under high activity

We just landed a big performance improvement for sites with active posting and many users yesterday, it should help a lot on your site.

https://github.com/discourse/discourse/commit/b1f32f2f5717c4f55b902485794e62b8cecd8522

9 Likes

Very good, we’ll have a look and potentially test this.

1 Like

Well, every game is an individual case. Now in the COVID situation (empty arena) and near-random game schedule the behavior of the audience is impossible to predict or compare to historical data.

Based on this single game, I can’t say that this change brought us significant improvement.

The 1st perioid was calm and fine, but events during the second caused a spike in messages and increase in lurkers. About 60% of our people said they experienced freezes.

In the two server setup, the web_only is the only one reporting high CPU usage and load average.

The extreme load / read-only -mode was not triggered, which is good as it is the most painful UX. Overall the audience has quickly learned to visit the index page and come back to resume the discussion - which generates more server load. If only the end user could somehow be informed that he is being throttled. Then he would be more likely to actually wait a minute.

Progress report from the private conversations: the experience was improved by setting DISCOURSE_REJECT_MESSAGE_BUS_QUEUE_SECONDS to 4, and we are planning some core changes to improve the message bus ratelimiting behavior.

6 Likes

Since we experience some similarities with @ljpp situation, although in a significantly lesser extension (almost exclusively in some 5 minutes around the time when the matches finish), I would like to know if there are any tweaks one can make to the threshold where the message of high load kicks in and users start being “kicked out” of the topic… because it’s always around one single topic, the match topic.

That and the 502 error message (purely nginx message) that we experience even more rarely in the same context. I suspect there are probably some configs in nginx that might benefit from some tuning, and I know it’s not your job, but I’m all ears for good suggestions :laughing:.

Please clarify - are you experiencing freezes (topic not updated for new posts) or are you getting extreme load error messages?

There are tweaks in this thread that provide some improvements for the freezing, but they also increase the system load, so you are more likely to get extreme load scenarios.

3 Likes

We experience sometimes topic freezing in those situations I’ve reported, but when that happens the system also shows warnings of extreme load. So I can’t tell you what is what.

We don’t mind extreme load as long as it doesn’t kick people from topics or interrupt update for new posts. We would actually prefer in that case to have it slowly loading stuff (the wheel could spin for 15 seconds for each user to read/post and we would prefer that to freezing or user being kicked out).

4 Likes

I have to agree. The extreme load UX is confusing for the end user.

  • How many concurrent users you have?
  • What kind of hardware?
  • Link to your forum stats?

@sam

As we are now on the CDCK SaaS platform, I can only observe this from the UX point of view.

We have had some good heat in the games during the last couple of weeks. The “freezes” have pretty much disappeared with the platform change, but there is this fluxuation in the way that the topic gets updated, which may still be confusing to some. But the audience has mostly (90%) stopped complaining and is focusing on the games, which is a good sign.

There is however a scenario which I can reproduce with fairly high (again 90%) confidence. The platform has occasional issues in resuming the session, when the game topic is in a background tab (Android) or under a locked screen. When I get back to the busy topic, usually due to an interesting event in the game, the topic view is not sometimes updated. I can see user avatars blinking at the bottom of the topic, but no posts are appearing. One needs to refresh the browser to fully recover.

The repro pattern is not the easiest, as you need:

  • A busy topic
  • Some good action in the game → more heat to the topic
  • Keep the topic under locked screen or on the background browser tab.
3 Likes

We suffer from that too.

Another thing is, when jumping to the first unread post, it can repeat this behaviour a few times (going to the same “unread post” a few times, although the first unread post position should have changed in each occasion).

To exemplify:

  1. I jump to the first unread post
  2. scroll and read the 100 unread posts
  3. then go to another topic or homepage…
  4. after a minute or so, there are like 30 new unread posts, but when I click on the icon, I’m thrown once gain to the position on 1 (meaning 130 posts backwards and not just the new unread 30).

But, once more, it only happens in very very busy topics during some minutes at the greatest peak of refresh and posting by every user all in the same topic at the same time. Kind of annoying but not a dealbreaker so far.

1 Like

I would consider that a success.

Can you provide a repro here on meta? Probably not since it requires a large number of active users idling in the same topic at the same time?

My current thinking is we should build a live chat feature and instantiate it just-in-time, when you have…

  • lots of users

  • in the same topic

  • at the same time

  • then, and only then, instantiate a live chat box overlay and strongly push users into using that instead of replies, maybe even disable the ability to reply to the topic with

    :loudspeaker: Hey, it looks like what you really wanted was a chatroom… here it is, have fun! :speech_balloon:

14 Likes

Yeah, I know what you mean, but it’s so limited to those occasions that I guess it’s not worth the effort. We usually have matches like that once to twice a week and it’s mostly at the 5 minute period as soon as the match finishes. But I’ve actually thought about it several times (that it would be nice to have a temporary chatroom function or switching thing to those 90-minute period of a football match). :laughing:

Still, I’ll try to repro one of these days by recording the screen for a while.

1 Like

Our instance has been showing some 429’s, as the playoff games have started. @staff should be able to see some in the last 3,5 hours of our logs, and more expected when the deciding goal is scored (game is going to second OT as I type this).

I anycase, if you are still logging and tracing this there are not many opportunities left, as the finals and the following off-season is getting close.

2 Likes

I just wanted to add my name to the thread here so I can follow this. We are a new gymnastics forum. We experienced the above along with “freezing” last night during US Olympic Trials. Here is the thread…

We had 4 unicorns last night.

I resized the server to 4 Intel vCPUs & 8 GB memory at Digital Ocean and did…

unicorn_workers: 8
db_shared_buffers: “2GB”

We are expecting much higher traffic during the Olympics. What else can we do to optimize the server for “chat like” traffic during the competition?

3 Likes

If you have hundreds of users in a single topic using Discourse as a chat and it’s a limited time event, I’d suggest bumping the server temporarily a bit more.

The larger Premium AMD droplet in Digital Ocean for the 16 days of the Olympics cost $54.85, and should be more than enough for a community of your size.

7 Likes

I do not have these lines in my app.yml. Do I just add them?

Yes. Add them in the env section.

1 Like

If this is still in the staff’s radar, our blast off is tonight at 18:30 (UTC+3) and again tomorrow at the same time.

There is much anticipation after two COVID ruined seasons, so I am expecting heavy traffic spikes at tappara.co

1 Like

@ljpp
what is your current situation? did Redis 6 help you?

We are now on CDCK SaaS, which is why gave the staff a heads up. We are a kind of a test bench for this matter.

3 Likes