Improving Instance Performance (Megatopics, Database Size and Extreme Load)

Thank you. You most likely saved me many hours of debugging, trial and error.

This difference in behavior has brought me to the conclusion of this being some applicational correlation that derives from a FrontEnd kind of issue “crashing” the app (being that FrontEnd is not my area of expertise, contrary to BackEnd) and the operations at hand by posting and people staying on a topic waiting for it to “self update” with tens of messages on a single minute.

To summarize this mega-sentence: Your conclusion is that the “choking” of rapid real-time discussion is a front-end issue?

I did not get far in our analysis, but I did observe that the CPU load was nowhere near the maximum, only around ~25% or so. Over the years we have hit CPU 100% many times before, but this was not the case in the latest incident last saturday. We only had like 150 concurrent users.

What lead me suspecting the backend, is the fact that we have been running these game chats for years, and I have not seen this “choking” before. Over the years the database has grown, and we are now over 800.000 posts.

When is the first time you saw this? Could this be a regression with the latest major update? Our site and performance was fine in spring and we have not grown that much over the summer.

1 Like

Babel is a spectacularly inefficient plugin, so if you are running with Babel, you are going to have a bad time … especially under lots of simultaneously active user loads.

1 Like

We have a TODO on our backlog now to improve performance for cases where 1000 people are looking at the same topic and one person posts.

As designed today we publish to all 1000 people “Hi, there is a new post for this topic” then all 1000 head to the server (mostly at the same time) asking … “hey, what is this new topic you are asking for”

We are going to look at minimally rate limiting this in a cleaner way so clients don’t hammer the server, but there are a bunch of optmisations we can make for this outlier case.

7 Likes

Great news that this is on your radar. Can you confirm if this has regressed since last spring?

Our local hockey league tries to start the games on October 1st. This means that our site can provide live traffic spikes on weekly basis, in case you need/want to study the behavior as it happens in a true (non-simulated) environment.

DM if interested. We are happy to support.

From UX point of view, the end user should somehow know that discussion is active, even when the system starts choking. This might prevent unnecessary browser refreshes.

No I can sadly not confirm that, this has always been the case.

3 Likes

The first game has just ended and we definitely have a problem that we did not have in March. Reason still unknown.

The game chat chokes on the frontend, while the server load should be far below 100%.

Ole of our users spotted a number of 429 server responses during the chokes, but I cannot say whether this is “normal” as we have not done such inspection before, during the games.

Have you @iceman seen this on your site?

I have seen one of those, while investigating a totally irrelevant ‘500’ one :sweat_smile:
The server wasn’t busy at all, but I was messing with a front nginx config (http2)

1 Like

By “game chat” do you mean “lots of users active on the same topic simultaneously”?

Indeed. There were around 900 reactionary replies in a timespan of a couple hours. @ljpp has more exact numbers of users, but we are speaking of hundreds of users browsing the topic at any time during the match.

Weirdly enough, this doesn’t affect every user. I, for instance, haven’t encountered any problems on any device. But it is wide-scale enough according to reports.

It is not so obvious to spot, especially if you are not paying close attention.

First there is a break of 30-60 seconds with no replies. Nothing seems “wrong”, its just quiet. You can even write your own post. Then suddenly you get dozens messages in a flash and you realize that you have been lagging behind. I have seen this on iOS Safari and Android Chrome

Our real time game chats are busy, but not extreme cases. Yesterday we had 972 messages over the course of ~3 hours.

Otteluseuranta: Lukko - Tappara 2.10.2020 - Liveseuranta - Tappara.co

The next game chat will take place today at 14:00 UTC. Due to pandemic, I expect similar numbers, even though it is a home game.

I agree with @pfaffman’s post about this.

Aren’t you trying to force a chat use case onto a forum platform?

Why don’t you instead integrate a chat service like Mattermost or Discord into the UI of your Discourse site and have this medium cover in-game discussion?

You could find some other way to cover the game in a forum Topic like pre-game or post-game discussion where the usage load may be less demanding but perhaps contain useful summary information that many users might like to retrieve at a later date.

I also don’t see the benefit of a huge volume of off the cuff chat being stored on a forum. Is anyone going to read that again? Is it useful?

2 Likes

Well, he uses the word “chat” for this, but his Discourse setup can’t handle “972 posts over the course of ~3 hours” in a topic according to the user … it should IMHO, even a simple phpbb can handle multiple times more per 3 hours than this.

1 Like

So 1 post every 10 seconds? On its own that doesn’t sound unreasonable. But then you make the Topic 1000 posts long and have several hundred users taking part and on top of that you get spikes of posts. I can see the challenge!

1 Like

But what is the real culprit/bottleneck here? The number of users taking part or the 1 post per 10 seconds or the rendering of the changed content to (too) many non-anonymous/anonymous users or the number of connections required to serve to many logged in users?

Will he get the same problem if just 2 users producing the same amount of posts in a topic in the same time frame?

Even with just 972 logged in users just making one post in that topic, is Discourse not able to handle this? And if so why? Is Discourse just the right choice for very small communities with a low number of simultaneous number of logged-in users?

I’m just wondering as we already have up to 400 simultaneous logged-in users at times producing up to 3000 posts per day… so far no problems.

You clearly need to take into account the server specs and the number of unicorns running, otherwise these stress scenario results are less meaningful.

Blenderartists (@bartv), I believe, have a 64GB server and around a dozen unicorns running? Quite a monster :slight_smile:

2 Likes

Absolutely, we’re just on 8 GB / 4 vCPUs with DO currently, not having any probs at the moment. So, if the solution to this is simply to throw resources (RAM and CPU) at it, fine with me. At the peak since relaunched with Discourse, we had about 2000 concurrent visitors on one post that went viral and the load was just above 1, CPU was at 60-70%. Average with approx. 200-250 concurrent users logged in the CPU is bored at around 15%-20% currently.

1 Like

Correct :slight_smile: Happy to share data, although we don’t use our Discourse as a chat and never see such spikes.

1 Like

You could make this argument, but I really dislike the idea of fragmenting conversations to two platforms. Real-timelyness is actually one of the killer features of Discourse, that the end users appreciate. Our game time conversations are a big hit and an important part of the community culture.

Note that we have been running these for four years and this is a new problem we are facing. So hundreds of games have gone nicely, without “forcing” anything.

One of our educated memebers had a theory – could we be hitting Discourse’s global limits and maybe CloudFlare has an impact.

DISCOURSE_MAX_REQS_PER_IP_PER_MINUTE : number of requests per IP per minute (default is 200)

DISCOURSE_MAX_REQS_PER_IP_PER_10_SECONDS : number of requests per IP per 10 seconds (default is 50)

Next game is in 1 hour and we’ll try that with CF cache disabled.

Note that we have been using CloudFlare for 4 years as well, even though it is not generally recommended here. There has been only one or two issues along the way, Brotli being one of them and the unupdated template another

1 Like

Nope, CF is not the root cause. Issue reproduced with cache disabled. 429’s were reported.

Ideas, anyone?

Yes I understand your concern here. I would not like to lose the record of some interesting insight lost to chat. It’s a tricky dilemma.