Improving Instance Performance (Megatopics, Database Size and Extreme Load)

codinghorror · April 9, 2020, 7:58pm

Agree, what is the value of content that nobody reads? And which people are gonna actually sit down and read 10,000 posts from start to finish?

It’s OK for some topics to be ephemeral chats which disappear entirely over time, what we’ve previously called “fast lane” versus “slow lane” traffic.

Iceman · April 24, 2020, 3:45pm

Small Update:

Since reported and agreed on closing everything above the mark (and re-establishing the limits) performance seems to be better, a lot better. We still get some “Due to extreme load, this is temporarily being shown to everyone as a logged out user would see it” but considerably lower.

With that said:

Thanks for the hard work looking into how to reduce that monster table, really appreciate it.
Are there any other “Performance Tips” or even “Setups” that we may be missing? Even if they are “advanced”, any help is appreciated.

Again, a huge thanks to everyone for helping and giving feedback.

codinghorror · April 24, 2020, 11:17pm

Postgres 12 will help as it reduces table and index size by 20% in @falco’s testing.

Stephen · April 25, 2020, 12:26am

Is there a target date for that yet?

Falco · April 25, 2020, 1:54am

I started doing bechmarks today.

Gotta let it run for a while here on Meta so we can watch for performance regressions before rolling it out everywhere.

Iceman · May 19, 2020, 5:00pm

I’m sorry for rising this one up again, but this is a different issue from the PostGreSQL Migration. (Which I’m not able to do yet due to size).

Since last rebuild my Database Size won’t stop increasing:

Last rebuild was on May 17th and since then the DB Size won’t stop increasing, reaching 57.3GB and the big size is on the post_timings table.

My main issue with this is that I’m trying to do the PostGreSQL update (which will reduce index size by 20% but won’t solve this in the long run). And from the comments here from the Staff that size is not the norm, so it will keep increasing and becoming nightmareshly expensive. The more time passes, the more it increases and creates a loop that becomes hell to maintain. So my main issue prevails, is there a way to tackle this post_timings thing? Something to delete?

Can I compact tables or something?

Thanks to everyone for their help.

Falco · May 20, 2020, 5:11pm

There is no way around it at the time. If your forum is really big, it will have a big timings table.

Meta is a really old instance, but it is mid-sized and has a small 4GB post_timings table. In the other hand we host one instance that is less than 2 years old and the post_timings table should be over 100GB by now.

Hosting big forums will cost you more, and there are no ways around it today.

Maybe move your Database to a standalone $20 droplet (80GB disk) and put the web in another $10 droplet? $30 in monthly costs for what appears to be a quite big community sounds reasonable.

Iceman · May 20, 2020, 11:46pm

Thank you very much for your help, @Falco.

Welp, yeah, as I said, just asking because there isn’t magic when it comes to space. I’m looking into the division but then the app one will be too slow due to perf (long story, I’m taking note of the tests and will present them here later for everyone else to make use if they find them useful).

I did the test I asked you about regarding recovering a backup and stuff and I think that this can be a good situation to leverage from since what I can immediately see is that there is 30% less disk usage (I’m still running some test to check if there isn’t anything missing) but now there is an small issue with this approach so even though the immediate benefits are great this will generate some issues (even more because I don’t know if it is cached or not working at all in terms of stored images, and yes, the backup includes them).

I’m taking notes so that I can update the OP, my idea here would be to add a small series of notes for people that may worry about performance and all the things I’ve been modifying so far.

Overgrow · May 21, 2020, 10:11am

Is applying “auto-delete replies” timer good solution to this technically wise?

Iceman · May 21, 2020, 3:02pm

Actually this is a pretty good idea and solves the problem of usability (because, as we said, no one is going to read 10K messages). So the big question is if that would be taxating to the server and database.

codinghorror · May 21, 2020, 6:32pm

I’m not sure if a 9000 reply topic with ~8600 of them deleted is good for perf, but somehow I doubt it. What say you @eviltrout?

Overgrow · May 21, 2020, 6:59pm

I thought that “hidden” posts are completely removed from database after some time, but now I see that is probably not the case.
So performance wise my idea can’t solve the problem.

Is there any way to “purge” this data?

eviltrout · May 21, 2020, 7:00pm

Deleted posts are soft deleted. However, they are often indexed so you might notice some improvements when a bunch are removed. I wouldn’t recommend it though. If there’s any way to move to a new topic once one gets large that might serve you well.

Falco · May 21, 2020, 7:07pm

What do you mean by this? Having the database and the web app in separate servers should give your more performance, because while there will be some latency between servers, your Unicorn and Postmaster won’t have to compete for CPU and RAM.

Make sure the servers are in the same DO region

Iceman · May 21, 2020, 8:27pm

Sorry, you are right, that would free them both and get better performance than everything on a single one (I was comparing to the resources I’m using on a single machine and the tweaks I’ve been doing so far).

That is actually a very good idea, let me try to solve the “unable to rebuild the data container” issue I have and that will be my next jump in this journey.

ljpp · September 20, 2020, 2:33pm

I have been searching my head off around this topic, but could not find documentation how to do this ideally. Does such guide exist?

We are also starting to hit a wall with our standard single VPS installation. Our rather unique dilemma is the game chats, that take place during hockey games and cause sharp spikes in activity/load. Especially if something extraordinary happens in the game.

Mevo · September 20, 2020, 2:46pm

You would need to have something powerful enough to withstand your busiest moments, I guess. Or you would need to increase performance during these times. Maybe look for VPS you can pay per hour. One solution (continuing the previous tip given) would be to move the web container to an extremely powerful VPS you pay for just a few hours when there are games.

Falco · September 20, 2020, 3:30pm

You need to:

Run PostgreSQL elsewhere (a droplet or use a hosted service like Worry-Free Managed Database Hosting | DigitalOcean), and move your data there.
Follow Running Discourse with a separate PostgreSQL server

ljpp · September 21, 2020, 9:56am

And this can be achieved also by using Discourse’s containerized products? Web_only and data, right?

Iceman · September 21, 2020, 10:12am

From my experience this is not directly solved by any current approach or has a linear solution. In fact, separating them in different machines is not an instant solution for that issue.

We also experience heavy drops and “the site is extremely busy so you are seeing it as someone that isn’t logged in” messages when a big event happens (such as a game, like @ljpp said), and that drags down the whole site, not only the people inside that topic.

So, I tried two different things, a separated setup and a “big machine”, both have this type of issues. My instances are monitored with Prometheus and the logs are visible on Grafana, etc, so I have a very granular control of hardware/container performance, and I can confirm that it really doesn’t matter what you do, the issue happens anyway.

If you put a big machine behind it you may delay it a little bit, but you will get the errors and sessions drops and the machine will be with almost no usage, be it disk, cpu or ram. And this happens both with the “default install” and “two container” installs.

With different machines the issue is the same, regardless of the machines being the same type of machine or one being “CPU-Optimized” and the other “Disk-Optimized”, etc. To this you also have to add the extra layer of possible failure of the connection between two different machines, that will inevitably lag, although this amount of lag can change in regards of how you setup that connection and “how far away” are the two machines from each other, but you will get the same behavior.

As a note, this type of behavior happens as well with things like the Babel plugin, however, seems to me that the Babel Plugin can handle a lot more “simultaneous” writes, even though the “chats” are actually hidden topics, but the difference is in how they are presented and “refreshed”/“pulled”. This difference in behavior has brought me to the conclusion of this being some applicational correlation that derives from a FrontEnd kind of issue “crashing” the app (being that FrontEnd is not my area of expertise, contrary to BackEnd) and the operations at hand by posting and people staying on a topic waiting for it to “self update” with tens of messages on a single minute.

To that you also have to add the human factor, when people feel the site is “sluggish” or that a topic “isn’t updating as fast as it should be”, they will F5 the hell out of it, adding more load. But good luck “educating” on that regard

Topic		Replies	Views
The MEGATOPIC: public good, or public menace? Community	41	15806	April 4, 2021
Real-time updating of topics freezes under high activity Support	89	7122	February 26, 2022
Performance improvements on long topics? Support	30	5411	August 20, 2018
Discourse installation has been getting slower and slower and slower Installation server-resources	37	1528	May 15, 2023
PostgreSQL runaway IO Installation	37	2947	March 17, 2021

Improving Instance Performance (Megatopics, Database Size and Extreme Load)

Related topics