Improving Instance Performance (Megatopics, Database Size and Extreme Load)

Sorry, you are right, that would free them both and get better performance than everything on a single one (I was comparing to the resources I’m using on a single machine and the tweaks I’ve been doing so far).

That is actually a very good idea, let me try to solve the “unable to rebuild the data container” issue I have and that will be my next jump in this journey.

1 Like

I have been searching my head off around this topic, but could not find documentation how to do this ideally. Does such guide exist?

We are also starting to hit a wall with our standard single VPS installation. Our rather unique dilemma is the game chats, that take place during hockey games and cause sharp spikes in activity/load. Especially if something extraordinary happens in the game.

You would need to have something powerful enough to withstand your busiest moments, I guess. Or you would need to increase performance during these times. Maybe look for VPS you can pay per hour. One solution (continuing the previous tip given) would be to move the web container to an extremely powerful VPS you pay for just a few hours when there are games.

You need to:

  1. Run PostgreSQL elsewhere (a droplet or use a hosted service like https://www.digitalocean.com/products/managed-databases/), and move your data there.

  2. Follow Running Discourse with a separate PostgreSQL server

2 Likes

And this can be achieved also by using Discourse’s containerized products? Web_only and data, right?

From my experience this is not directly solved by any current approach or has a linear solution. In fact, separating them in different machines is not an instant solution for that issue.

We also experience heavy drops and “the site is extremely busy so you are seeing it as someone that isn’t logged in” messages when a big event happens (such as a game, like @ljpp said), and that drags down the whole site, not only the people inside that topic.

So, I tried two different things, a separated setup and a “big machine”, both have this type of issues. My instances are monitored with Prometheus and the logs are visible on Grafana, etc, so I have a very granular control of hardware/container performance, and I can confirm that it really doesn’t matter what you do, the issue happens anyway.

If you put a big machine behind it you may delay it a little bit, but you will get the errors and sessions drops and the machine will be with almost no usage, be it disk, cpu or ram. And this happens both with the “default install” and “two container” installs.

With different machines the issue is the same, regardless of the machines being the same type of machine or one being “CPU-Optimized” and the other “Disk-Optimized”, etc. To this you also have to add the extra layer of possible failure of the connection between two different machines, that will inevitably lag, although this amount of lag can change in regards of how you setup that connection and “how far away” are the two machines from each other, but you will get the same behavior.

As a note, this type of behavior happens as well with things like the Babel plugin, however, seems to me that the Babel Plugin can handle a lot more “simultaneous” writes, even though the “chats” are actually hidden topics, but the difference is in how they are presented and “refreshed”/“pulled”. This difference in behavior has brought me to the conclusion of this being some applicational correlation that derives from a FrontEnd kind of issue “crashing” the app (being that FrontEnd is not my area of expertise, contrary to BackEnd) and the operations at hand by posting and people staying on a topic waiting for it to “self update” with tens of messages on a single minute.

To that you also have to add the human factor, when people feel the site is “sluggish” or that a topic “isn’t updating as fast as it should be”, they will F5 the hell out of it, adding more load. But good luck “educating” on that regard :stuck_out_tongue:

1 Like

Thank you. You most likely saved me many hours of debugging, trial and error.

This difference in behavior has brought me to the conclusion of this being some applicational correlation that derives from a FrontEnd kind of issue “crashing” the app (being that FrontEnd is not my area of expertise, contrary to BackEnd) and the operations at hand by posting and people staying on a topic waiting for it to “self update” with tens of messages on a single minute.

To summarize this mega-sentence: Your conclusion is that the “choking” of rapid real-time discussion is a front-end issue?

I did not get far in our analysis, but I did observe that the CPU load was nowhere near the maximum, only around ~25% or so. Over the years we have hit CPU 100% many times before, but this was not the case in the latest incident last saturday. We only had like 150 concurrent users.

What lead me suspecting the backend, is the fact that we have been running these game chats for years, and I have not seen this “choking” before. Over the years the database has grown, and we are now over 800.000 posts.

When is the first time you saw this? Could this be a regression with the latest major update? Our site and performance was fine in spring and we have not grown that much over the summer.

1 Like

Babel is a spectacularly inefficient plugin, so if you are running with Babel, you are going to have a bad time … especially under lots of simultaneously active user loads.

1 Like

We have a TODO on our backlog now to improve performance for cases where 1000 people are looking at the same topic and one person posts.

As designed today we publish to all 1000 people “Hi, there is a new post for this topic” then all 1000 head to the server (mostly at the same time) asking … “hey, what is this new topic you are asking for”

We are going to look at minimally rate limiting this in a cleaner way so clients don’t hammer the server, but there are a bunch of optmisations we can make for this outlier case.

7 Likes

Great news that this is on your radar. Can you confirm if this has regressed since last spring?

Our local hockey league tries to start the games on October 1st. This means that our site can provide live traffic spikes on weekly basis, in case you need/want to study the behavior as it happens in a true (non-simulated) environment.

DM if interested. We are happy to support.

From UX point of view, the end user should somehow know that discussion is active, even when the system starts choking. This might prevent unnecessary browser refreshes.

No I can sadly not confirm that, this has always been the case.

3 Likes

The first game has just ended and we definitely have a problem that we did not have in March. Reason still unknown.

The game chat chokes on the frontend, while the server load should be far below 100%.

Ole of our users spotted a number of 429 server responses during the chokes, but I cannot say whether this is “normal” as we have not done such inspection before, during the games.

Have you @iceman seen this on your site?

I have seen one of those, while investigating a totally irrelevant ‘500’ one :sweat_smile:
The server wasn’t busy at all, but I was messing with a front nginx config (http2)

1 Like

By “game chat” do you mean “lots of users active on the same topic simultaneously”?

Indeed. There were around 900 reactionary replies in a timespan of a couple hours. @ljpp has more exact numbers of users, but we are speaking of hundreds of users browsing the topic at any time during the match.

Weirdly enough, this doesn’t affect every user. I, for instance, haven’t encountered any problems on any device. But it is wide-scale enough according to reports.

It is not so obvious to spot, especially if you are not paying close attention.

First there is a break of 30-60 seconds with no replies. Nothing seems “wrong”, its just quiet. You can even write your own post. Then suddenly you get dozens messages in a flash and you realize that you have been lagging behind. I have seen this on iOS Safari and Android Chrome

Our real time game chats are busy, but not extreme cases. Yesterday we had 972 messages over the course of ~3 hours.

Otteluseuranta: Lukko - Tappara 2.10.2020 - Liveseuranta - Tappara.co

The next game chat will take place today at 14:00 UTC. Due to pandemic, I expect similar numbers, even though it is a home game.

I agree with @pfaffman’s post about this.

Aren’t you trying to force a chat use case onto a forum platform?

Why don’t you instead integrate a chat service like Mattermost or Discord into the UI of your Discourse site and have this medium cover in-game discussion?

You could find some other way to cover the game in a forum Topic like pre-game or post-game discussion where the usage load may be less demanding but perhaps contain useful summary information that many users might like to retrieve at a later date.

I also don’t see the benefit of a huge volume of off the cuff chat being stored on a forum. Is anyone going to read that again? Is it useful?

2 Likes

Well, he uses the word “chat” for this, but his Discourse setup can’t handle “972 posts over the course of ~3 hours” in a topic according to the user … it should IMHO, even a simple phpbb can handle multiple times more per 3 hours than this.

1 Like

So 1 post every 10 seconds? On its own that doesn’t sound unreasonable. But then you make the Topic 1000 posts long and have several hundred users taking part and on top of that you get spikes of posts. I can see the challenge!

1 Like

But what is the real culprit/bottleneck here? The number of users taking part or the 1 post per 10 seconds or the rendering of the changed content to (too) many non-anonymous/anonymous users or the number of connections required to serve to many logged in users?

Will he get the same problem if just 2 users producing the same amount of posts in a topic in the same time frame?

Even with just 972 logged in users just making one post in that topic, is Discourse not able to handle this? And if so why? Is Discourse just the right choice for very small communities with a low number of simultaneous number of logged-in users?

I’m just wondering as we already have up to 400 simultaneous logged-in users at times producing up to 3000 posts per day… so far no problems.