Slow Sidekiq + Postmaster using 95%+ CPU (32 cores) after Postgresql Version Upgrade

Can you try a vacuum verbose analyze? Also applies to you @eboehnisch. This following very simple query averaging 2s is really weird:

As said in the PostgreSQL 12 update topic, generating statistics is a good idea after the update.

4 Likes

Tried VACUUM VERBOSE ANALYZE;. What shall I be looking for in the results?

1 Like

Great, thanks! So I think this has solved the issue with postmasters going wild using all the CPU. So this is a big improvement.

Sidekiq is still slow unfortunately, but better than before running VACUUM VERBOSE ANALYZE;

2 Likes

Nothing special, unless it is a big error message.

Performance should recover now.

YAY!

It should be able to win the race and process the queue eventually now.

3 Likes

This improved the situation definitely but after a while again three postmaster processes use 100% CPU for over a minute each.

1 Like

I’d say wait a bit, because you probably have a queue to chew from the slow period. Check /sidekiq for stats.

1 Like

:smiley: Speed has really picked up now, looks like it’s fixed. Thanks a lot for the help @Falco! I’ll leave it running for a while and post an update here later~

2 Likes

Maybe try running:

VACUUM FULL VERBOSE;

and (assuming your db name is ‘discourse’ which is default):

REINDEX DATABASE discourse;

This might be better than normal reindex, but I haven’t tested it: PostgreSQL 12 update

Both operations listed above generates locks that can be really hard in a live site. I’d recommend the ones listed at post-update-optional-tasks because other than using lots of CPU, they don’t block the normal database operations.

2 Likes

Run both. The latter reports an error, however:

discourse=# REINDEX SCHEMA CONCURRENTLY public;
ERROR:  could not create unique index "index_users_on_username_lower_ccnew"
DETAIL:  Key (username_lower)=(marks) is duplicated.

I guess it’s time to get into PostgreSQL and to find and delete the record.

The vacuuming did the trick, more or less, but I still get peaks every few minutes with one to three postmaster processes consuming CPU for one or two minutes each. So I guess there’s more.

1 Like

Is it possible to revert to PostgreSQL 10?

1 Like

Sure, if you have a backup from before the upgrade. You’re going to lose everything created since the backup though.

It’s almost certainly better to get to the bottom of these problems, Postgres12 is now standard for any new install.

1 Like

No, that is not possible… we are permanently moving to PG12.

1 Like

see:

2 Likes

Ok so to follow up, everything still looks good, postmaster has calmed down drastically and sidekiq is moving really fast again.

To summarise on what worked in my scenario (with default db name of ‘discourse’):

cd /var/discourse/
./launcher enter app
sudo -u postgres psql
\c discourse

Then from the postgres console, each of these below were run. Each one takes a bit of time to complete, depending on DB size, heavy CPU usage for the first two also:

VACUUM FULL VERBOSE;

REINDEX DATABASE discourse;

VACUUM VERBOSE ANALYZE;

Note: I didn’t notice a difference until VACUUM VERBOSE ANALYZE; was ran as per @Falco’s recommendation, so the first two might not have been necessary. Though, the first two seemed to be key to fixing this issue in the past on the previous version of Postgresql.

If you get an error like ‘ERROR: deadlock detected’ during REINDEX DATABASE discourse; - just try running it again until it works. This happened to me last time (previous Postgresql version).

There are some recommendations to run a concurrent reindex instead of the reindex above: PostgreSQL 12 update

However, note that @eboehnisch above got an error from the concurrent reindex, see above.

7 Likes

We have seen this happen enough in various moves that I feel we ought to add it to the migration script @Falco

I know is slows stuff down a fair bit, but it is worth it, will cut down on support.

Note… the VERBOSE word there just means that it outputs stuff as it is going … the big thing missing was the VACUUM ANALYZE reindex is mostly about reclaiming space.

6 Likes

Thank you, everyone here! I fixed the duplicate key, dropped the obsolete indexes, and then ran

REINDEX DATABASE discourse;
VACUUM VERBOSE ANALYZE;

successfully. CPU usage has returned back to normal. Phew.

Question: Why does a non-optimized database or a broken index lead to such high CPU usage by the postmaster? Just curious.

4 Likes

I think the broken index is a red herring, it is certainly not great, and should be fixed

The giant issue is that this move from 10-12 leaves the db with a terrible statistics story, this leads to bad performance .

Perf is bad cause the query optimiser chooses very bad plans for queries, cause the statistics it has about data in the tables is totally off

We will integrate the rebuilding of statistics via vacuum into our automated move

7 Likes

Thank you, @sam, for the explanations. That makes sense. I guess it’s a good idea to integrate the rebuilding into the automated move. Thank you, again, for you help!

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.