Endlessly running Postgres processes & bad performance after Reinstall/restore

In an effort to update our Forum I did a fresh VPS install + Restore on the weekend.

This was going to solve multiple issues for us:

  • Renew outdated ubuntu
  • Update Discourse
  • Upgrade to Postgres 15

While generally things went fine I saw issues coming up afterwards with Postgres processes going wild using 100% of a core. Varying numbers of processes though. I tried a few things from a rebuild to restarts. Currently trying a rake db:validate_indexes which is already running for a few hours without any feedback. I’m not sure if I had done this before and if it’s supposed to happen quicker.

Using the forum works well basically but definitely slowed down. Some longer running tasks like pulling up user profiles from more active users takes notably longer than usual.

I’m quite sure there are some issues with the database but I’m having a hard time finding out which ones.

I gotta say our database is quite huge - we’re at around 150gb after the restore and after index creation. From monitoring the restore process I have not seen any errors and index creation went fine to my eyes.

Any idea on how to tackle this? It’s 3 postgress processes right now - it’s been 6 before a restart that I did a few hours ago - I’ve already seen all 16 cores being used after the restore as well.

EDIT: I just noticed right now that 3 sidekiq processes are buys with “indexing categories for search”. May this all just be search index rebuilding? If so can this be solved otherwise? When we do the live system restore, this will be a huge problem if it degrades performance like that over multiple hours or even days.

Right now it’s only one sidekiq task running with “Jobs::BackfillBadge” but still 7 postgres processes are seemingly blocking 100% cpu constantly. Really curious what’s going on there.

After such moves it is a good idea to run a vacuum for database stats.

1 Like

How much ram and cpu do you have?

How much memory do you give postgres?

That test serer is on 32GB, 16 core, config is set to 64MB work mem.

EDIT: Shared buffers are at 8GB

Currently doing a vaccuum which looks stuck

Not sure if it’s doing something but it’s been there for 30+ minutes already.

I did put the forum to read only and reboot the VM to end the 7 Postgres processes that were “stuck” there before. Quickly after the restart 2 of these postgres processes were back and didn’t change. Nothing running in sidekiq right now.

You really don’t want to run a vacuum full. All you need to get performance back is a VACUUM VERBOSE ANALYZE. You can’t run a FULL in a running site.

1 Like

I’m not expert at huge databases, but I’d make the buffers two or three times that.

I’m sure you have indexes that are 8gb

:thinking: Postgres recommends never setting shared_buffers over 40% of the internal memory?

That said,

Your server might be underdimensioned.

2 Likes

Aha! Sensible advice from an expert! So maybe I was right that 8GB/25% isn’t enough, and though 16GB is more than 40%, it might still be a good suggestion because . . . .

1 Like

Guys. as said this is a test server - there is no traffic on it. This server definitely isn’t good enough for production use but that’s not the issue here. The question is why we do see postgres processes sticking like that (with 100% CPU usage) and slowing down things drastically. We were running the testserver with lower capacity even until a few days ago - it was only upsized due to the lack of disk size for a restore.

Production machine is running with 128GB of ram with the same Shared buffer settings without any issues - So I don’t think there’s a general issue with these settings and shared buffer size - especially not a private test machine without traffic.

But anyways - I will just redo the restore and see if something might have gone wrong as seemingly there’s not good explanation for the behaviour.