Postgresql 版本升级后,Slow Sidekiq + Postmaster 使用 95%+ CPU(32 核)

所以我今天进行了升级,过程中 PostgreSQL 版本似乎也被更新了。自那以后,这个问题(或类似问题)又出现了(旧主题现已关闭):

我尝试了之前解决该问题的方法:

cd /var/discourse/
./launcher enter app
sudo -u postgres psql
\c discourse
VACUUM FULL VERBOSE;

以及:

REINDEX DATABASE discourse;

但毫无效果 :upside_down_face:

Postmaster 占用了 32 个 CPU 核心的 95%-100% 资源。Sidekiq 的处理速度约为每秒 1 个任务。

我还尝试了清空 Redis 并重新构建。我测试过 12、8 和 4 个 Unicorn Sidekiq 进程以及 Unicorn 工作进程。

在 PostgreSQL 升级之前,使用 12 个 Unicorn 工作进程和 Sidekiq 进程时运行非常稳定流畅。当前运行版本为 2.5.0.beta5,我认为升级前运行的是 2.5.0.beta2。

:firstworldproblem:

1 个赞

您需要检查 PostgreSQL 实际执行了哪些查询,以便我们提供建议。

为此,启用 pg_stat_statements 非常有用。

3 个赞

感谢 @Falco - 这里有一些统计数据,如果您需要查看被截断的查询,请告诉我。

                                                                                                                                                     query                                                                                                                                                     | total_time | calls |   min   |  mean   |    max    | pct_cpu
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-------+---------+---------+-----------+---------
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like | 4018799.92 |  2031 | 1458.34 | 1978.73 |   2690.51 |   38.90
SELECT COUNT(*) FROM "posts" WHERE ("posts"."deleted_at" IS NULL) AND (posts.post_type IN ($1,$2,$3)) AND (post_number > COALESCE(( SELECT last_read_post_number FROM topic_users tu WHERE tu.user_id = $4 AND tu.topic_id = $5 ),$6)) AND (reply_to_user_id = $7 OR exists( SELECT $8 from topic_users tu W | 3959915.35 |  2021 | 1467.53 | 1959.38 |   2753.25 |   38.33
SELECT COUNT(*) FROM "posts" INNER JOIN "topics" ON "topics"."id" = "posts"."topic_id" AND ("topics"."deleted_at" IS NULL) WHERE ("posts"."deleted_at" IS NULL) AND (topics.category_id = $1) AND (topics.visible = $2) AND (posts.deleted_at IS NULL) AND (posts.user_deleted = $3) AND (topics.id <> $4) A |  349840.54 |   148 |   55.14 | 2363.79 | 111182.89 |    3.39
SELECT "posts"."id" FROM "posts" WHERE ("posts"."deleted_at" IS NULL) AND "posts"."topic_id" = $1 AND (posts.user_id = $2 OR post_type IN ($3,$4,$5)) AND "posts"."post_number" = $6 LIMIT $7                                                                                                                |  237109.72 |   123 | 1488.37 | 1927.72 |   2387.02 |    2.30
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like |  223843.91 |   560 |  269.46 |  399.72 |    522.30 |    2.17
SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "posts"."cooked", "posts"."created_at", "posts"."updated_at", "posts"."reply_to_post_number", "posts"."reply_count", "posts"."quote_count", "posts"."deleted_at", "posts"."off_topic_count", "posts"."like |  222330.42 |   113 | 1516.53 | 1967.53 |   2620.96 |    2.15
SELECT l.post_id, l.url, l.clicks, COALESCE(t.title, l.title) AS title, l.internal, l.reflection, l.domain FROM topic_links l LEFT JOIN topics t ON t.id = l.link_topic_id LEFT JOIN categories AS c ON c.id = t.category_id WHERE (t.deleted_at IS NULL) AND (COALESCE(t.archetype, $1) <> $2) AND (l.post_ |   77597.13 |   134 |  479.18 |  579.08 |    662.30 |    0.75
SELECT "posts"."id", EXTRACT($1 FROM CURRENT_TIMESTAMP - created_at)::INT AS days_ago FROM "posts" WHERE ("posts"."deleted_at" IS NULL) AND "posts"."topic_id" = $2 AND (posts.user_id = $3 OR post_type IN ($4,$5,$6)) ORDER BY "posts"."sort_order" ASC                                                    |   57100.18 |   131 |  285.36 |  435.88 |    545.55 |    0.55
SELECT MAX("posts"."post_number") FROM "posts" WHERE ("posts"."deleted_at" IS NULL) AND "posts"."topic_id" = $1 AND (posts.user_id = $2 OR post_type IN ($3,$4,$5))                                                                                                                                          |   56887.08 |   131 |  285.22 |  434.25 |    534.65 |    0.55
SELECT COALESCE(SUM(rs.score), $1) AS total_spam_score, COUNT(DISTINCT rs.user_id) AS spam_user_count FROM reviewables AS r INNER JOIN reviewable_scores AS rs ON rs.reviewable_id = r.id WHERE r.target_created_by_id = $2 AND rs.reviewable_score_type = $3 AND rs.status IN ($4, $5)                      |   16211.73 |   119 |   86.25 |  136.23 |    163.52 |    0.16
SELECT COUNT(*) FROM "topics" WHERE ("topics"."deleted_at" IS NULL) AND "topics"."category_id" = $1 AND "topics"."visible" = $2 AND (topics.id <> $3) AND (topics.created_at > $4)                                                                                                                           |   14922.91 |   152 |   61.68 |   98.18 |    154.13 |    0.14
SELECT COUNT(*) FROM "users" INNER JOIN "topic_allowed_users" ON "users"."id" = "topic_allowed_users"."user_id" WHERE "topic_allowed_users"."topic_id" = $1                                                                                                                                                  |   14358.68 |   186 |    9.17 |   77.20 |    109.96 |    0.14
SELECT ftl.url, COALESCE(ft.title, ftl.title) AS title, ftl.link_topic_id, ftl.reflection, ftl.internal, ftl.domain, MIN(ftl.user_id) AS user_id, SUM(clicks) AS clicks FROM topic_links AS ftl LEFT JOIN topics AS ft ON ftl.link_topic_id = ft.id LEFT JOIN categories AS c ON c.id = ft.category_id WHERE |   14029.70 |   186 |   12.82 |   75.43 |     94.01 |    0.14
SELECT COUNT(*) FROM ( SELECT $1 FROM notifications n LEFT JOIN topics t ON t.id = n.topic_id WHERE t.deleted_at IS NULL AND n.high_priority = $2 AND n.user_id = $3 AND n.id > $4 AND NOT read LIMIT $5 ) AS X                                                                                              |    3364.46 |  3790 |    0.02 |    0.89 |      3.21 |    0.03
SELECT * FROM ( SELECT n.id, n.read FROM notifications n LEFT JOIN topics t ON n.topic_id = t.id WHERE t.deleted_at IS NULL AND n.high_priority AND n.user_id = $1 AND NOT read ORDER BY n.id DESC LIMIT $2 ) AS x UNION ALL SELECT * FROM ( SELECT n.id, n.read FROM notifications n LEFT JOIN topics t ON  |     968.26 |  3654 |    0.06 |    0.26 |      1.04 |    0.01
SELECT tags.name as tag_name, SUM(stats.topic_count) AS sum_topic_count FROM category_tag_stats stats JOIN tags ON stats.tag_id = tags.id AND stats.topic_count > $1 WHERE stats.category_id in ($2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26,$27,$28,$29,$30 |     953.21 |   173 |    3.57 |    5.51 |      7.52 |    0.01
SELECT a.attname FROM ( SELECT indrelid, indkey, generate_subscripts(indkey, $1) idx FROM pg_index WHERE indrelid = $2::regclass AND indisprimary ) i JOIN pg_attribute a ON a.attrelid = i.indrelid AND a.attnum = i.indkey[i.idx] ORDER BY i.idx                                                           |     657.76 |   163 |    0.01 |    4.04 |      8.66 |    0.01
SELECT $1 AS one FROM "push_subscriptions" WHERE "push_subscriptions"."user_id" = $2 LIMIT $3                                                                                                                                                                                                                |     381.25 |  2029 |    0.10 |    0.19 |      0.41 |    0.00
INSERT INTO "notifications" ("notification_type", "user_id", "data", "created_at", "updated_at", "topic_id", "post_number") VALUES ($1, $2, $3, $4, $5, $6, $7) RETURNING "id"                                                                                                                               |     339.56 |  2028 |    0.08 |    0.17 |      0.94 |    0.00
SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2                                                                                                                                                                                                                                               |     313.51 |  7521 |    0.02 |    0.04 |      0.16 |    0.00
2 个赞

简单提一句,我也遇到了同样的问题——不过很遗憾,我对 PostgreSQL 并不熟悉。如果我要去排查,需要更详细的操作指引。抱歉。

3 个赞

能否尝试执行 vacuum verbose analyze@eboehnisch 您也适用此建议。以下这条极其简单的查询平均耗时 2 秒,实在令人费解:

正如 PostgreSQL 12 更新 主题中所提到的,在更新后生成统计信息是一个好主意。

4 个赞

已尝试执行 VACUUM VERBOSE ANALYZE;。在结果中我应该关注哪些信息?

1 个赞

太好了,谢谢!我认为这已经解决了邮件服务器进程占用全部 CPU 的问题。这是一个巨大的改进。

不幸的是,Sidekiq 仍然很慢,但比运行 VACUUM VERBOSE ANALYZE; 之前要好一些。

2 个赞

没什么特别的,除非出现大的错误信息。

性能现在应该已经恢复了。

太棒了!

现在它应该能够赢得竞争并最终处理队列了。

3 个赞

这确实改善了情况,但过了一段时间后,又有三个 postmaster 进程每个持续占用 100% 的 CPU 超过一分钟。

1 个赞

我建议稍等片刻,因为你可能在慢时段积压了任务队列。请检查 /sidekiq 以获取统计信息。

1 个赞

:smiley: 速度现在真的提上来了,看来已经修复了。非常感谢你的帮助 @Falco!我会让它继续运行一段时间,稍后再来这里更新一下~

2 个赞

也许可以尝试运行:

VACUUM FULL VERBOSE;

然后(假设您的数据库名称为 ‘discourse’,这是默认值):

REINDEX DATABASE discourse;

这可能比常规的重索引更好,但我尚未测试:PostgreSQL 12 update

上面列出的两个操作都会生成锁,这在活跃站点中可能非常棘手。我建议使用 post-update-optional-tasks 中列出的操作,因为它们除了占用大量 CPU 外,不会阻塞正常的数据库操作。

2 个赞

同时运行两者。不过,后者报告了一个错误:

discourse=# REINDEX SCHEMA CONCURRENTLY public;
ERROR:  could not create unique index "index_users_on_username_lower_ccnew"
DETAIL:  Key (username_lower)=(marks) is duplicated.

我想是时候深入研究 PostgreSQL,找到并删除那条记录了。

清理操作(vacuuming)大致奏效了,但我仍然每隔几分钟就会出现峰值,每次有 1 到 3 个 postmaster 进程持续占用 CPU 一到两分钟。所以我想问题还没完全解决。

1 个赞

可以回退到 PostgreSQL 10 吗?

1 个赞

当然,如果你有升级前的备份。不过,自备份之后创建的所有内容都将丢失。

彻底解决这些问题几乎肯定是更好的选择,因为对于任何新安装,Postgres12 现在已是标准版本。

1 个赞

不,那是不可能的……我们将永久转向PG12。

1 个赞

参见:

2 个赞

好的,跟进一下:一切看起来仍然正常,postmaster 的负载已大幅下降,sidekiq 的运行速度也再次变得非常快。

总结一下在我的场景中(默认数据库名为 ‘discourse’)有效的方法:

cd /var/discourse/
./launcher enter app
sudo -u postgres psql
\c discourse

然后在 postgres 控制台中依次执行以下命令。每条命令的执行时间取决于数据库大小,前两条命令还会导致 CPU 重度使用:

VACUUM FULL VERBOSE;

REINDEX DATABASE discourse;

VACUUM VERBOSE ANALYZE;

注意:直到按照 @Falco 的建议执行了 VACUUM VERBOSE ANALYZE; 之后,我才察觉到明显变化,因此前两条命令可能并非必要。不过,在上一版本的 PostgreSQL 中,前两条命令似乎是解决该问题的关键 在此处

如果在执行 REINDEX DATABASE discourse; 时遇到类似 ‘ERROR: deadlock detected’ 的错误,只需重试直到成功即可。我上次(在上一版本的 PostgreSQL 中)也遇到了这种情况。

有一些建议是执行并发重索引(concurrent reindex),而不是上述的重索引操作:PostgreSQL 12 update

但请注意,上面的 @eboehnisch 在执行并发重索引时遇到了错误,详见此处

8 个赞

我们在多次迁移中已经多次遇到这种情况,我认为我们应该将其添加到迁移脚本中,@Falco……

我知道这会让过程变慢不少,但这是值得的,能减少后续的支持工作量。

注意……其中的 VERBOSE 一词仅表示在执行过程中会输出相关信息;真正缺失的关键部分是 VACUUM ANALYZE。重新索引的主要目的是回收空间。

6 个赞