Improving Instance Performance (Megatopics, Database Size and Extreme Load)

Agree, what is the value of content that nobody reads? And which people are gonna actually sit down and read 10,000 posts from start to finish? :crazy_face:

Itā€™s OK for some topics to be ephemeral chats which disappear entirely over time, what weā€™ve previously called ā€œfast laneā€ versus ā€œslow laneā€ traffic.

5 Likes

Small Update:

Since reported and agreed on closing everything above the mark (and re-establishing the limits) performance seems to be better, a lot better. We still get some ā€œDue to extreme load, this is temporarily being shown to everyone as a logged out user would see itā€ but considerably lower.

With that said:

  • Thanks for the hard work looking into how to reduce that monster table, really appreciate it.
  • Are there any other ā€œPerformance Tipsā€ or even ā€œSetupsā€ that we may be missing? Even if they are ā€œadvancedā€, any help is appreciated.

Again, a huge thanks to everyone for helping and giving feedback.

3 Likes

Postgres 12 will help as it reduces table and index size by 20% in @falcoā€™s testing.

4 Likes

Is there a target date for that yet?

1 Like

I started doing bechmarks today.

Gotta let it run for a while here on Meta so we can watch for performance regressions before rolling it out everywhere.

6 Likes

Iā€™m sorry for rising this one up again, but this is a different issue from the PostGreSQL Migration. (Which Iā€™m not able to do yet due to size).

Since last rebuild my Database Size wonā€™t stop increasing:

image

Last rebuild was on May 17th and since then the DB Size wonā€™t stop increasing, reaching 57.3GB and the big size is on the post_timings table.

My main issue with this is that Iā€™m trying to do the PostGreSQL update (which will reduce index size by 20% but wonā€™t solve this in the long run). And from the comments here from the Staff that size is not the norm, so it will keep increasing and becoming nightmareshly expensive. The more time passes, the more it increases and creates a loop that becomes hell to maintain. So my main issue prevails, is there a way to tackle this post_timings thing? Something to delete?

Can I compact tables or something?

Thanks to everyone for their help.

1 Like

There is no way around it at the time. If your forum is really big, it will have a big timings table.

Meta is a really old instance, but it is mid-sized and has a small 4GB post_timings table. In the other hand we host one instance that is less than 2 years old and the post_timings table should be over 100GB by now.

Hosting big forums will cost you more, and there are no ways around it today.

Maybe move your Database to a standalone $20 droplet (80GB disk) and put the web in another $10 droplet? $30 in monthly costs for what appears to be a quite big community sounds reasonable.

4 Likes

Thank you very much for your help, @Falco.

Welp, yeah, as I said, just asking because there isnā€™t magic when it comes to space. Iā€™m looking into the division but then the app one will be too slow due to perf (long story, Iā€™m taking note of the tests and will present them here later for everyone else to make use if they find them useful).

I did the test I asked you about regarding recovering a backup and stuff and I think that this can be a good situation to leverage from since what I can immediately see is that there is 30% less disk usage (Iā€™m still running some test to check if there isnā€™t anything missing) but now there is an small issue with this approach so even though the immediate benefits are great this will generate some issues (even more because I donā€™t know if it is cached or not working at all in terms of stored images, and yes, the backup includes them).

Iā€™m taking notes so that I can update the OP, my idea here would be to add a small series of notes for people that may worry about performance and all the things Iā€™ve been modifying so far.

Is applying ā€œauto-delete repliesā€ timer good solution to this technically wise?

1 Like

Actually this is a pretty good idea and solves the problem of usability (because, as we said, no one is going to read 10K messages). So the big question is if that would be taxating to the server and database.

3 Likes

Iā€™m not sure if a 9000 reply topic with ~8600 of them deleted is good for perf, but somehow I doubt it. What say you @eviltrout?

1 Like

I thought that ā€œhiddenā€ posts are completely removed from database after some time, but now I see that is probably not the case.
So performance wise my idea canā€™t solve the problem.

Is there any way to ā€œpurgeā€ this data?

Deleted posts are soft deleted. However, they are often indexed so you might notice some improvements when a bunch are removed. I wouldnā€™t recommend it though. If thereā€™s any way to move to a new topic once one gets large that might serve you well.

3 Likes

What do you mean by this? Having the database and the web app in separate servers should give your more performance, because while there will be some latency between servers, your Unicorn and Postmaster wonā€™t have to compete for CPU and RAM.

Make sure the servers are in the same DO region :stuck_out_tongue:

3 Likes

Sorry, you are right, that would free them both and get better performance than everything on a single one (I was comparing to the resources Iā€™m using on a single machine and the tweaks Iā€™ve been doing so far).

That is actually a very good idea, let me try to solve the ā€œunable to rebuild the data containerā€ issue I have and that will be my next jump in this journey.

1 Like

I have been searching my head off around this topic, but could not find documentation how to do this ideally. Does such guide exist?

We are also starting to hit a wall with our standard single VPS installation. Our rather unique dilemma is the game chats, that take place during hockey games and cause sharp spikes in activity/load. Especially if something extraordinary happens in the game.

You would need to have something powerful enough to withstand your busiest moments, I guess. Or you would need to increase performance during these times. Maybe look for VPS you can pay per hour. One solution (continuing the previous tip given) would be to move the web container to an extremely powerful VPS you pay for just a few hours when there are games.

You need to:

  1. Run PostgreSQL elsewhere (a droplet or use a hosted service like https://www.digitalocean.com/products/managed-databases/), and move your data there.

  2. Follow Running Discourse with a separate PostgreSQL server

2 Likes

And this can be achieved also by using Discourseā€™s containerized products? Web_only and data, right?

From my experience this is not directly solved by any current approach or has a linear solution. In fact, separating them in different machines is not an instant solution for that issue.

We also experience heavy drops and ā€œthe site is extremely busy so you are seeing it as someone that isnā€™t logged inā€ messages when a big event happens (such as a game, like @ljpp said), and that drags down the whole site, not only the people inside that topic.

So, I tried two different things, a separated setup and a ā€œbig machineā€, both have this type of issues. My instances are monitored with Prometheus and the logs are visible on Grafana, etc, so I have a very granular control of hardware/container performance, and I can confirm that it really doesnā€™t matter what you do, the issue happens anyway.

If you put a big machine behind it you may delay it a little bit, but you will get the errors and sessions drops and the machine will be with almost no usage, be it disk, cpu or ram. And this happens both with the ā€œdefault installā€ and ā€œtwo containerā€ installs.

With different machines the issue is the same, regardless of the machines being the same type of machine or one being ā€œCPU-Optimizedā€ and the other ā€œDisk-Optimizedā€, etc. To this you also have to add the extra layer of possible failure of the connection between two different machines, that will inevitably lag, although this amount of lag can change in regards of how you setup that connection and ā€œhow far awayā€ are the two machines from each other, but you will get the same behavior.

As a note, this type of behavior happens as well with things like the Babel plugin, however, seems to me that the Babel Plugin can handle a lot more ā€œsimultaneousā€ writes, even though the ā€œchatsā€ are actually hidden topics, but the difference is in how they are presented and ā€œrefreshedā€/ā€œpulledā€. This difference in behavior has brought me to the conclusion of this being some applicational correlation that derives from a FrontEnd kind of issue ā€œcrashingā€ the app (being that FrontEnd is not my area of expertise, contrary to BackEnd) and the operations at hand by posting and people staying on a topic waiting for it to ā€œself updateā€ with tens of messages on a single minute.

To that you also have to add the human factor, when people feel the site is ā€œsluggishā€ or that a topic ā€œisnā€™t updating as fast as it should beā€, they will F5 the hell out of it, adding more load. But good luck ā€œeducatingā€ on that regard :stuck_out_tongue:

1 Like