Slow Sidekiq + Postmaster using 95%+ CPU (32 cores) after Postgresql Version Upgrade

So I did an upgrade today and the postgresql version seems to have been updated in the process. Since doing this, this issue (or similar) has returned (old topic is closed now):

I tried what solved this issue before:

cd /var/discourse/
./launcher enter app
sudo -u postgres psql
\c discourse
VACUUM FULL VERBOSE;

and also:

REINDEX DATABASE discourse;

No luck :upside_down_face:

Postmaster is using 95%-100% of 32 CPU cores. Sidekiq moving at ~1 job per second.

I tried also flushing redis and rebuilding again. I have tested between 12, 8 and 4 unicorn sidekiqs and unicorn workers.

Was working really solid and smooth before the Postgresql upgrade with 12 unicorn workers and sidekiqs. Currently running 2.5.0.beta5 and I believe was running 2.5.0.beta2 before the updating.

:firstworldproblem:

1 Like

You need to check what are the exact queries PostgreSQL is running in order for us to have suggestions.

Enabling pg_stat_statements is really useful for that.

3 Likes

Thanks @Falco - here are some stats, let me know if you need to see more of the queries as they get cut off at the end.

                                                                                                                                                     query                                                                                                                                                     | total_time | calls |   min   |  mean   |    max    | pct_cpu
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-------+---------+---------+-----------+---------
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like | 4018799.92 |  2031 | 1458.34 | 1978.73 |   2690.51 |   38.90
SELECT COUNT(*) FROM "posts" WHERE ("posts"."deleted_at" IS NULL) AND (posts.post_type IN ($1,$2,$3)) AND (post_number > COALESCE(( SELECT last_read_post_number FROM topic_users tu WHERE tu.user_id = $4 AND tu.topic_id = $5 ),$6)) AND (reply_to_user_id = $7 OR exists( SELECT $8 from topic_users tu W | 3959915.35 |  2021 | 1467.53 | 1959.38 |   2753.25 |   38.33
SELECT COUNT(*) FROM "posts" INNER JOIN "topics" ON "topics"."id" = "posts"."topic_id" AND ("topics"."deleted_at" IS NULL) WHERE ("posts"."deleted_at" IS NULL) AND (topics.category_id = $1) AND (topics.visible = $2) AND (posts.deleted_at IS NULL) AND (posts.user_deleted = $3) AND (topics.id <> $4) A |  349840.54 |   148 |   55.14 | 2363.79 | 111182.89 |    3.39
SELECT "posts"."id" FROM "posts" WHERE ("posts"."deleted_at" IS NULL) AND "posts"."topic_id" = $1 AND (posts.user_id = $2 OR post_type IN ($3,$4,$5)) AND "posts"."post_number" = $6 LIMIT $7                                                                                                                |  237109.72 |   123 | 1488.37 | 1927.72 |   2387.02 |    2.30
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like |  223843.91 |   560 |  269.46 |  399.72 |    522.30 |    2.17
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like |  222330.42 |   113 | 1516.53 | 1967.53 |   2620.96 |    2.15
SELECT l.post_id, l.url, l.clicks, COALESCE(t.title, l.title) AS title, l.internal, l.reflection, l.domain FROM topic_links l LEFT JOIN topics t ON t.id = l.link_topic_id LEFT JOIN categories AS c ON c.id = t.category_id WHERE (t.deleted_at IS NULL) AND (COALESCE(t.archetype, $1) <> $2) AND (l.post_ |   77597.13 |   134 |  479.18 |  579.08 |    662.30 |    0.75
SELECT "posts"."id", EXTRACT($1 FROM CURRENT_TIMESTAMP - created_at)::INT AS days_ago FROM "posts" WHERE ("posts"."deleted_at" IS NULL) AND "posts"."topic_id" = $2 AND (posts.user_id = $3 OR post_type IN ($4,$5,$6)) ORDER BY "posts"."sort_order" ASC                                                    |   57100.18 |   131 |  285.36 |  435.88 |    545.55 |    0.55
SELECT MAX("posts"."post_number") FROM "posts" WHERE ("posts"."deleted_at" IS NULL) AND "posts"."topic_id" = $1 AND (posts.user_id = $2 OR post_type IN ($3,$4,$5))                                                                                                                                          |   56887.08 |   131 |  285.22 |  434.25 |    534.65 |    0.55
SELECT COALESCE(SUM(rs.score), $1) AS total_spam_score, COUNT(DISTINCT rs.user_id) AS spam_user_count FROM reviewables AS r INNER JOIN reviewable_scores AS rs ON rs.reviewable_id = r.id WHERE r.target_created_by_id = $2 AND rs.reviewable_score_type = $3 AND rs.status IN ($4, $5)                      |   16211.73 |   119 |   86.25 |  136.23 |    163.52 |    0.16
SELECT COUNT(*) FROM "topics" WHERE ("topics"."deleted_at" IS NULL) AND "topics"."category_id" = $1 AND "topics"."visible" = $2 AND (topics.id <> $3) AND (topics.created_at > $4)                                                                                                                           |   14922.91 |   152 |   61.68 |   98.18 |    154.13 |    0.14
SELECT COUNT(*) FROM "users" INNER JOIN "topic_allowed_users" ON "users"."id" = "topic_allowed_users"."user_id" WHERE "topic_allowed_users"."topic_id" = $1                                                                                                                                                  |   14358.68 |   186 |    9.17 |   77.20 |    109.96 |    0.14
SELECT ftl.url, COALESCE(ft.title, ftl.title) AS title, ftl.link_topic_id, ftl.reflection, ftl.internal, ftl.domain, MIN(ftl.user_id) AS user_id, SUM(clicks) AS clicks FROM topic_links AS ftl LEFT JOIN topics AS ft ON ftl.link_topic_id = ft.id LEFT JOIN categories AS c ON c.id = ft.category_id WHERE |   14029.70 |   186 |   12.82 |   75.43 |     94.01 |    0.14
SELECT COUNT(*) FROM ( SELECT $1 FROM notifications n LEFT JOIN topics t ON t.id = n.topic_id WHERE t.deleted_at IS NULL AND n.high_priority = $2 AND n.user_id = $3 AND n.id > $4 AND NOT read LIMIT $5 ) AS X                                                                                              |    3364.46 |  3790 |    0.02 |    0.89 |      3.21 |    0.03
SELECT * FROM ( SELECT n.id, n.read FROM notifications n LEFT JOIN topics t ON n.topic_id = t.id WHERE t.deleted_at IS NULL AND n.high_priority AND n.user_id = $1 AND NOT read ORDER BY n.id DESC LIMIT $2 ) AS x UNION ALL SELECT * FROM ( SELECT n.id, n.read FROM notifications n LEFT JOIN topics t ON  |     968.26 |  3654 |    0.06 |    0.26 |      1.04 |    0.01
SELECT tags.name as tag_name, SUM(stats.topic_count) AS sum_topic_count FROM category_tag_stats stats JOIN tags ON stats.tag_id = tags.id AND stats.topic_count > $1 WHERE stats.category_id in ($2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26,$27,$28,$29,$30 |     953.21 |   173 |    3.57 |    5.51 |      7.52 |    0.01
SELECT a.attname FROM ( SELECT indrelid, indkey, generate_subscripts(indkey, $1) idx FROM pg_index WHERE indrelid = $2::regclass AND indisprimary ) i JOIN pg_attribute a ON a.attrelid = i.indrelid AND a.attnum = i.indkey[i.idx] ORDER BY i.idx                                                           |     657.76 |   163 |    0.01 |    4.04 |      8.66 |    0.01
SELECT $1 AS one FROM "push_subscriptions" WHERE "push_subscriptions"."user_id" = $2 LIMIT $3                                                                                                                                                                                                                |     381.25 |  2029 |    0.10 |    0.19 |      0.41 |    0.00
INSERT INTO "notifications" ("notification_type", "user_id", "data", "created_at", "updated_at", "topic_id", "post_number") VALUES ($1, $2, $3, $4, $5, $6, $7) RETURNING "id"                                                                                                                               |     339.56 |  2028 |    0.08 |    0.17 |      0.94 |    0.00
SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2                                                                                                                                                                                                                                               |     313.51 |  7521 |    0.02 |    0.04 |      0.16 |    0.00
2 Likes

Just a quick note that I have the same problem — and no expertise in PostgreSQL, I’m afraid. If I shall look somewhere I need a bit more detailed instructions. Sorry.

3 Likes

Can you try a vacuum verbose analyze? Also applies to you @eboehnisch. This following very simple query averaging 2s is really weird:

As said in the PostgreSQL 12 update topic, generating statistics is a good idea after the update.

4 Likes

Tried VACUUM VERBOSE ANALYZE;. What shall I be looking for in the results?

1 Like

Great, thanks! So I think this has solved the issue with postmasters going wild using all the CPU. So this is a big improvement.

Sidekiq is still slow unfortunately, but better than before running VACUUM VERBOSE ANALYZE;

2 Likes

Nothing special, unless it is a big error message.

Performance should recover now.

YAY!

It should be able to win the race and process the queue eventually now.

3 Likes

This improved the situation definitely but after a while again three postmaster processes use 100% CPU for over a minute each.

1 Like

I’d say wait a bit, because you probably have a queue to chew from the slow period. Check /sidekiq for stats.

1 Like

:smiley: Speed has really picked up now, looks like it’s fixed. Thanks a lot for the help @Falco! I’ll leave it running for a while and post an update here later~

2 Likes

Maybe try running:

VACUUM FULL VERBOSE;

and (assuming your db name is ‘discourse’ which is default):

REINDEX DATABASE discourse;

This might be better than normal reindex, but I haven’t tested it: PostgreSQL 12 update

Both operations listed above generates locks that can be really hard in a live site. I’d recommend the ones listed at post-update-optional-tasks because other than using lots of CPU, they don’t block the normal database operations.

2 Likes

Run both. The latter reports an error, however:

discourse=# REINDEX SCHEMA CONCURRENTLY public;
ERROR:  could not create unique index "index_users_on_username_lower_ccnew"
DETAIL:  Key (username_lower)=(marks) is duplicated.

I guess it’s time to get into PostgreSQL and to find and delete the record.

The vacuuming did the trick, more or less, but I still get peaks every few minutes with one to three postmaster processes consuming CPU for one or two minutes each. So I guess there’s more.

1 Like

Is it possible to revert to PostgreSQL 10?

1 Like

Sure, if you have a backup from before the upgrade. You’re going to lose everything created since the backup though.

It’s almost certainly better to get to the bottom of these problems, Postgres12 is now standard for any new install.

1 Like

No, that is not possible… we are permanently moving to PG12.

1 Like

see:

2 Likes

Ok so to follow up, everything still looks good, postmaster has calmed down drastically and sidekiq is moving really fast again.

To summarise on what worked in my scenario (with default db name of ‘discourse’):

cd /var/discourse/
./launcher enter app
sudo -u postgres psql
\c discourse

Then from the postgres console, each of these below were run. Each one takes a bit of time to complete, depending on DB size, heavy CPU usage for the first two also:

VACUUM FULL VERBOSE;

REINDEX DATABASE discourse;

VACUUM VERBOSE ANALYZE;

Note: I didn’t notice a difference until VACUUM VERBOSE ANALYZE; was ran as per @Falco’s recommendation, so the first two might not have been necessary. Though, the first two seemed to be key to fixing this issue in the past on the previous version of Postgresql.

If you get an error like ‘ERROR: deadlock detected’ during REINDEX DATABASE discourse; - just try running it again until it works. This happened to me last time (previous Postgresql version).

There are some recommendations to run a concurrent reindex instead of the reindex above: PostgreSQL 12 update

However, note that @eboehnisch above got an error from the concurrent reindex, see above.

7 Likes

We have seen this happen enough in various moves that I feel we ought to add it to the migration script @Falco

I know is slows stuff down a fair bit, but it is worth it, will cut down on support.

Note… the VERBOSE word there just means that it outputs stuff as it is going … the big thing missing was the VACUUM ANALYZE reindex is mostly about reclaiming space.

6 Likes