Real-time updating of topics freezes under high activity

ljpp · October 9, 2020, 4:28pm

Note: I am not certain if this is a bug in Discourse. I have tried to collect necessary evidence, and so far I have not found anything pointing to our infra/setup. Our configuration is as vanilla as as possible at Tappara.co

Observed phenomenom:

Chat-like rapid discussion topics stop updating automatically. After a delay of 30 - 180 secs the updating usually resumes, revealing the posts that were made during the freeze.

What we know so far

We did not see this during the previous season, the last game was played in March.
We run stable branch and did the latest major update in August.
The issue was immediately reported in the first exhibition games, with moderate traffic/activity.
This impacts iOS and Android Chrome, but is far less frequent on Chromebook.
- As I write this, I am seeing freezes on my Android phone, while the discussion flows as expected on my Chromebook. Two different devices in the same network.
The experience varies per user/client. Different users report the freezes at different times. Overall we just recorded roughly 300 messages in about 30 minutes and users reported dozens of freezes. Mostly the freezes seem to correlate with events in the game (goals, penalties).

Things I have tried to rule out

CloudFlare – we did one game without CF caching, and the issue persisted.
CPU overload – CPU usage is well within limits, usually hovering around 20-30%.
Disk exhaustion – Disk I/O seems to be well within limits. We have UpCloud’s MaxIOPS SSD’s.

Other info

I had the Chrome inspector running during the game and some 429’s were recorded, but for me they did not correlate with freezing.
The end users are not receiving the notifications regarding 429’s (slow down) or extreme load. The updating just freezes, and then just resumes. Has the rate limiter changed recently - I am under the impression that ratelimits should trigger a notice on the UI?

A really nasty problem, that really hurts the play-by-play game chats. We have been running these for years, and I have never seen this before.

Falco · October 9, 2020, 5:05pm

Well, it’s not a bug but a feature .

When your web workers start to be overwhelmed by requests, we introduce delays in the persistent connection that auto updates the topic when new posts arrive.

If you are seeing this and low CPU usage as reported, please increase the number of unicorn workers and it should solve this issue.

ljpp · October 9, 2020, 6:06pm

We are already at 11 Unicorns, on a 6-core VPS. Like said, this never occurred during the previous season, latest in March. Occurs now even with moderate traffic. Also this happens a lot more on a mobile device, especially Android Chrome, than it does on a desktop.

Also, we have been able to exhaust our CPU before (at the trade transfer deadline).

I missed a game monitoring and fiddling with the server. We doubled the web.ratelimited parameters, but that did not solve the issue.

Inspector catches a numer of 429’s:

1. Request URL:

https://tappara.co/message-bus/3ed86765a67f4c31ba4053a0352ecaf5/poll

2. Request Method:

POST

3. Status Code:

429

Next game tomorrow, so I can try the Unicorns. How high we can go? Has this changed with the latest major update?

Edit:

I had a look at the stats, and our activity has so far been less than we saw in spring (pageviews, users). The number of posts per game chat is identical (around 900-1000 per 3 hours).

So for a reason unknown, we are not currently able to serve the same audience we had in March.

sam · October 9, 2020, 8:01pm

I am working on analysis of this issue over the next 2 weeks, it will take time to improve

ljpp · October 10, 2020, 5:19am

Great! Could you confirm whether there has been been a recent (6 months) change or regression, that has caused this?

In the mean time, I think I’ll pump up the unicorns for tonights game and see what happens. If we can support in any way, just let me know.

ljpp · October 10, 2020, 6:12pm

@falco The number of unicorns is definitely not the key. I increased them to 15 for toninghts game. The game topic was calm, only 700 messages, but constant freezing was observed. CPU load was mild, in between 5-25%.

The more I investigate this, more likely it seems like an issue and regression, but my skills are not good enough to identify where.

sam · October 11, 2020, 10:44pm

I think there is a bug in the client where it basically stops updating after it gets 1 error. My suspicion is that you are experiencing this cause your users are getting rate limited.

Will be investigating making the client more robust this week, as I said earlier, this is fiddly code and will take time.

ljpp · October 12, 2020, 4:32am

Great. In the mean time, I’ll test if disabling the global limiter would be a workaround. We’ll know on wednesday.

ljpp · October 12, 2020, 6:32pm

Debugging this issue lead me into thinking the UX of a near real-time discussion in general. Many communities deal with real life events, which naturally “push” the discussion towards a chat-like rapid conversation. It can be the stock market, a major product launch event, or a game (eSport or physical)…you name it.

But in this kind of chat-like discussion culture, the quality of the posts vary a lot. On the other the posts have a natural tendency to happen exactly at a same time. Lets imagine that there is a big hockey game going on and someone scores a goal.

A large portion of the posts are just emotional reactons, cheers or woes.
- “Goooool!!! Yeah baby”
Some are informative:
- “Crosby scores, 1-0 for the Pens”
A small minority takes a effort to put in some analysis:
- “Crosby scores a breakaway goal, after a careless forechecking effort by the Caps, but it looked a lot like offside play. The Caps coach should challenge this.”

Now Discourse being a fast platform (near real-time), this means that even when things run smoothly, you will get a couple of dozen posts virtually on the exact same moment. Now for the reader, especially some who is not seeing the game but follows it on the chat topic, this causes a UX challenge – it is hard to spot those informative posts in the middle of the cheers and woes. On our forum’s game chats we often get the question “What’s the score?”, as chatters watching the game forget to post the obvious, or the information is lost in the flood of messages.

I am not how this would work in real life, but it would interesting to test if the admins could set the pace of the discussion. Say for example, one post per second. All the posts would be queued, but published on the site at a defined pace. If a goal generates 20 reaction posts, they would not appear in the topic at the same time, but during a time window of 20 seconds. Could be easier to follow and catch the relevant information?

This could of course lead into other problems, if the pace of the new messages would constantly exceed the pace of publishing them – there would be an queue of increasing length and the chat would start lagging behind the real world.

Not sure if you got the idea, and even I am not sure if the idea is any good. The bottom line is that the UX of real-time chatting is an interesting topic of discussion and might have potential for further development? I do understand that the primary focus of Discourse is not to make a chat plaform – there are other software for that. But they do happen, naturally.

RGJ · October 12, 2020, 6:52pm

I like that idea, but it would need some kind of reverse shadow banning so people will always see their own posts immediately. If they don’t then they might be double or even triple posting, thinking that the forum does not work.

sam · October 13, 2020, 6:01am

I just merged this:

It ensures that we do not take out a server if 1000 people are looking at one topic and posts are being made.

The client now behave far cleaner in these cases.

Anticipating @ljpp’s question here, I am still undecided on backporting, this makes API changes and is a rather big change. If we do backport … it is probably a few weeks out. I need to observe this in production under load and we get so few events like this since we have so much breathing room on our hosting that it will take a while to catch.

ljpp · October 13, 2020, 6:16am

Jedi mind tricks

We will try if disabling ratelimiter is a feasible workaround.
If it is: The next stable release is not too far away, I presume.
If it is not, we’ll have a look at the Beta channel. We will have to verify that our UI customization does not break with the update.

Do we have any other communities that have similar chat-like discussions and running on edge branches?

sam · October 13, 2020, 6:28am

Expect one by the end of the year… so I would not expect it any time super soon. We will though cut another beta this week!

All of our hosting runs on beta… so yeah, but we have tons of capacity.

ljpp · October 13, 2020, 2:57pm

I understand the reasoning why this might not be a backport-candidate. I have just disabled our ratelimiter and tomorrow is the next game, so we’ll have a rough idea whether it serves as a feasible workaround for instances that are unwilling to go beta.

We are definitely considering to roll on the beta branch for the next couple of months. Though there are some other concerns are well - @rizka pointed out that the FI translation is lagging behind (but he might be able to work on it later this week).

ljpp · October 14, 2020, 6:28pm

@sam

Unfortunately disabling the ratelimiter did not help at all. It was a boring game and 83 users posted only 580 messages. Several freezes were reported during the game.

Are there any potential hacks or workarounds to try, while waiting for the upgrade to an edge release?

sam · October 14, 2020, 11:13pm

The “freezes” are a client bug it simply did not react right to error conditions. Even 1 rate limit error and you are toast on stable.

I can not think about workaround short of updating to beta (we are cutting a new one tomorrow)

ljpp · October 15, 2020, 6:54am

One of our dev oriented members proposed adjusting the following variable. What do you think - do you see this as a potential workaround?

DISCOURSE_REJECT_MESSAGE_BUS_QUEUE_SECONDS: 0.2

ljpp · October 19, 2020, 8:08am

We tried the hack:

DISCOURSE_REJECT_MESSAGE_BUS_QUEUE_SECONDS: 0.2

This reduced the number of observed slowdowns significantly, but did not solve the problem. CPU load increased, spinking around ~55%, at the times of major events in the game.

riking · October 20, 2020, 6:32am

There have been recent changes to help with the “freezes”, the client will back off and wait if the server is overloaded.

The ultimate answer may be to get a bigger server and run more Unicorn workers, though. See the discourse-setup script for our recommendations on server capacity to worker count.

sam · October 20, 2020, 6:59am

I doubt it… the issue under high reply load is that previous to the new design we could cause a flood that would trigger rate limits due to max_reqs_per_ip_per_10_seconds and such. You would need enormous resources to handle load.

Consider.

30 users post a reply within 10 seconds
100 people are looking at the topic.
Server needs to be able to hand 3000 GET requests to ask for a single post a time.
If ANY of these requests fail for any reason, UI would freeze and appear broken.

New design resolves this issue very cleanly, stuff backs off cleanly, requests are batched if we get a backlog, ui does not freeze and so on.

I can not see the old design scaling to 100 concurrent users and 30 replies in 10 seconds.

I can see the current revised design working fine with 1000 concurrent users looking at a topic with 30 replies in 10 seconds.

Topic		Replies	Views
Improving Instance Performance (Megatopics, Database Size and Extreme Load) Installation	60	4830	October 13, 2020
Extreme load error Installation	19	1510	August 13, 2023
Topic history not loading correctly after migration Support	21	893	February 8, 2023
Discourse installation has been getting slower and slower and slower Installation server-resources	37	1534	May 15, 2023
The MEGATOPIC: public good, or public menace? Community	41	15855	April 4, 2021

Real-time updating of topics freezes under high activity

Observed phenomenom:

What we know so far

Things I have tried to rule out

Other info

Related topics