private-message-topic-tracking-state.json 加载缓慢

blattersturm · 2021 年9 月 2 日 19:16

自 2.8.0.beta5 版本起，加载群组私信收件箱需要 4 秒以上，其中大部分时间（2-3 秒）消耗在 private-message-topic-tracking-state.json 的大型 SQL 查询上。

我们约有 80 万个主题、380 万用户，并且可能存在一个庞大的未读私信与用户关联图，但在查询分析中我并未发现太多值得注意的问题（可能是我不太擅长解读）。

在 beta5 之前一切正常，现在私信收件箱变得非常难以追踪，令人烦恼。

Falco · 2021 年9 月 2 日 19:41

嗯，我们在 Meta 上的平均耗时约为 200 毫秒。

不过，现在访问个人资料中的消息会发起 3 次独立的请求，这确实有点可疑。而且你提到的那个请求返回了一个包含 900 个对象的 92kB JSON 文件，这让我很震惊

blattersturm · 2021 年9 月 2 日 19:52

这似乎确实会在中途使 miniprofiler 的状态失效，导致我甚至无法获取 SQL 查询以正确地进行 EXPLAIN 分析。

Falco · 2021 年9 月 2 日 20:13

是的，那条路由转换表现不太正常。

顺便提一下，我提取了一个执行计划：ieWq | explain.depesz.com

tgxworld · 2021 年9 月 3 日 05:52

此修复应能将其降至约 13 毫秒。

请注意，未读消息越多，查询的开销就越大。

@sam 我还需要修复另一件事：标记未读为已读时，不会更新我们为普通主题和私信主题设置的 first_unread_at 优化。

blattersturm · 2021 年9 月 3 日 06:28

在 cherry-picking 此更改并重启 unicorn 后的 1.73 秒，但我和我的组的“未读时间”可能滞后得相当多（我当前邮箱仅显示 59 条新消息和 4 条未读——这个路由是按用户划分的吗？不过即使对于“所有消息”，我也只有 61 条新消息和 7 条未读）。

tgxworld · 2021 年9 月 3 日 06:33

未读时间是否与您的收件箱中的未读消息相关联？

blattersturm · 2021 年9 月 3 日 08:01

discourse=# select first_unread_pm_at from group_users where user_id = 2334 order by first_unread_pm_at asc;
     first_unread_pm_at
----------------------------
 2021-09-03 06:54:13.493514
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
(17 rows)

discourse=# select first_unread_pm_at from user_stats where user_id = 2334;
     first_unread_pm_at
----------------------------
 2021-09-03 07:05:00.565522
(1 row)

奇怪的是，这个时间范围甚至不算特别大，所以应该是其他条件导致了性能缓慢（不过，神秘的是，在忽略几条未读消息后，查询速度确实提升到了 700 毫秒，这还可以接受。我不确定我之前的 first_unread_pm_at 范围是多少，也许是几周？）。

有没有办法让我在消息页面刷新并清除所有旧请求后，从 miniprofiler 中获取查询语句，以便我运行 EXPLAIN？

tgxworld · 2021 年9 月 3 日 08:21

这确实很奇怪，但我无法在开发环境中复现此问题。下周我会花时间进一步调查。

tgxworld · 2021 年9 月 7 日 05:37

已在以下链接中修复：

blattersturm · 2021 年9 月 7 日 22:08

谢谢！我会尽量记住在下一次重建/重启窗口时再次检查。

blattersturm · 2021 年9 月 8 日 04:47

拿到了查询计划/EXPLAIN ANALYZE 结果（最高达 1.3 秒：confused）：CvUX | explain.depesz.com

在执行 reindex index concurrently index_topics_on_timestamps_private; 后，速度有所提升 (768ms)。这可能是因为之前大量的私信（DM）记录所致（在清理之前，叙事机器人严重污染了私信表）……不过在该索引上的查询速度仍然较慢。

tgxworld · 2021 年9 月 8 日 05:52

你能在 Rails 控制台中为我运行 Topic.where(archetype: Archetype.private_message).count 吗？

顺便提一下，该查询现在是异步执行的，因此不会阻塞消息页面或私信的加载。

blattersturm · 2021 年9 月 8 日 10:10

[1] pry(main)> Topic.where(archetype: Archetype.private_message).count
=> 543855

情况可能更糟，我们最近已经清理了所有遗留的叙事机器人主题。不过还是比我预期的要多！

啊，难怪那个查询要过一会儿才在 miniprofiler 中显示。这很好！

forkythetoy · 2021 年9 月 8 日 19:13

我们最近再次更新了以获取此功能 https://github.com/discourse/discourse/pull/13603（即当前版本为 2.8.0beta6）。

看起来“获取用户所有私信”的功能，最终会指向 FEATURE: New and Unread messages for user personal messages. by tgxworld · Pull Request #13603 · discourse/discourse · GitHub

SELECT "topics"."id", "topics"."title", "topics"."last_posted_at", "topics"."created_at", "topics"."updated_at", "topics"."views", "topics"."posts_count", "topics"."user_id", "topics"."last_post_user_id", "topics"."reply_count", "topics"."featured_user1_id", "topics"."featured_user2_id", "topics"."featured_user3_id", "topics"."deleted_at", "topics"."highest_post_number", "topics"."like_count", "topics"."incoming_link_count", "topics"."category_id", "topics"."visible", "topics"."moderator_posts_count", "topics"."closed", "topics"."archived", "topics"."bumped_at", "topics"."has_summary", "topics"."archetype", "topics"."featured_user4_id", "topics"."notify_moderators_count", "topics"."spam_count", "topics"."pinned_at", "topics"."score", "topics"."percent_rank", "topics"."subtype", "topics"."slug", "topics"."deleted_by_id", "topics"."participant_count", "topics"."word_count", "topics"."excerpt", "topics"."pinned_globally", "topics"."pinned_until", "topics"."fancy_title", "topics"."highest_staff_post_number", "topics"."featured_link", "topics"."reviewable_score", "topics"."image_upload_id", "topics"."slow_mode_seconds" FROM "topics" LEFT OUTER JOIN topic_users AS tu ON (topics.id = tu.topic_id AND tu.user_id = 1234) LEFT JOIN group_archived_messages gm ON gm.topic_id = topics.id
LEFT JOIN user_archived_messages um
ON um.user_id = 1234
AND um.topic_id = topics.id WHERE "topics"."deleted_at" IS NULL AND (topics.id IN (
SELECT topic_id
FROM topic_allowed_users
WHERE user_id = 1234
 UNION ALL
SELECT topic_id FROM topic_allowed_groups
 WHERE group_id IN (
SELECT group_id FROM group_users WHERE user_id = 1234
  )
)) AND "topics"."archetype" = 'private_message' AND "topics"."visible" = TRUE AND (um.user_id IS NULL AND gm.topic_id IS NULL) ORDER BY topics.bumped_at DESC LIMIT 30;

该查询基本导致我们的 RDS 服务器挂起。我们团队的一位成员建议，在连接之前先利用现有索引过滤数据。优化后的查询运行时间小于 100 毫秒。

SELECT "topics_filter"."id", "topics_filter"."title", "topics_filter"."last_posted_at", "topics_filter"."created_at", "topics_filter"."updated_at",
"topics_filter"."views", "topics_filter"."posts_count", "topics_filter"."user_id", "topics_filter"."last_post_user_id", "topics_filter"."reply_count", "topics_filter"."featured_user1_id",
"topics_filter"."featured_user2_id", "topics_filter"."featured_user3_id", "topics_filter"."deleted_at", "topics_filter"."highest_post_number", "topics_filter"."like_count",
"topics_filter"."incoming_link_count", "topics_filter"."category_id", "topics_filter"."visible", "topics_filter"."moderator_posts_count", "topics_filter"."closed",
"topics_filter"."archived", "topics_filter"."bumped_at", "topics_filter"."has_summary", "topics_filter"."archetype", "topics_filter"."featured_user4_id", "topics_filter"."notify_moderators_count",
"topics_filter"."spam_count", "topics_filter"."pinned_at", "topics_filter"."score", "topics_filter"."percent_rank", "topics_filter"."subtype", "topics_filter"."slug",
"topics_filter"."deleted_by_id", "topics_filter"."participant_count", "topics_filter"."word_count", "topics_filter"."excerpt", "topics_filter"."pinned_globally",
"topics_filter"."pinned_until", "topics_filter"."fancy_title", "topics_filter"."highest_staff_post_number", "topics_filter"."featured_link", "topics_filter"."reviewable_score",
"topics_filter"."image_upload_id", "topics_filter"."slow_mode_seconds"
FROM (select * from "topics"
      where "topics"."deleted_at" IS NULL
      and (topics.id IN ( SELECT topic_id FROM topic_allowed_users WHERE user_id = 1234
                          UNION ALL
                          SELECT topic_id FROM topic_allowed_groups  WHERE group_id IN ( SELECT group_id FROM group_users WHERE user_id = 1234 )
                         ))
      AND "topics"."archetype" = 'private_message'
      AND "topics"."visible" = TRUE
      )   as topics_filter
LEFT OUTER JOIN (select topic_id from topic_users where user_id = 1234) AS tu ON (topics_filter.id = tu.topic_id )
LEFT JOIN group_archived_messages gm ON gm.topic_id = topics_filter.id
LEFT JOIN ( select user_id, topic_id from user_archived_messages where user_id = 1234) um
ON um.topic_id = topics_filter.id
WHERE (um.user_id IS NULL AND gm.topic_id IS NULL) ORDER BY topics_filter.bumped_at DESC LIMIT 30;

供参考：topics 表超过 1000 万行，topic_users 表超过 5000 万行。

是否有可能重构该查询以某种方式解决这些问题？

编辑：

澄清一下，当我们检查运行时间超过 2 分钟甚至 5 分钟的长查询时，出现的只有这些查询。在应用最新更改后，我们的 Postgres CPU 持续处于 99% 的使用率。当我们终止这些查询时，CPU 使用率会恢复正常。

forkythetoy · 2021 年9 月 8 日 19:19

@tgxworld 很抱歉直接提及您，但既然您是作者，我想请教您是否有任何建议或反馈。

codinghorror · 2021 年9 月 8 日 19:58

据我所知，@sam 一直将其描述为一项高风险且危险的功能。如果这里出现重大回归，我并不会感到意外。

Hooksmith · 2021 年9 月 8 日 20:01

我们在发布此内容后发现了这个关于 bug 的讨论帖：Slow loading on private-message-topic-tracking-state.json - #9 by tgxworld

看起来在标记为 v2.8.0beta6 的提交之后，可能还有一个提交似乎解决了部分性能问题。我们打算尝试针对该特定提交引用重新构建，看看是否能缓解任何问题。

Falco · 2021 年9 月 8 日 21:04

您当前的 RDS 实例类型和数据库总大小是多少？

forkythetoy · 2021 年9 月 8 日 21:34

db.m5.2xlarge

磁盘使用量约为 250GB。

话题		回复	浏览量
App/models/topic_tracking_state - `report` - long response times? Development slow-sql	6	1810	2016 年2 月 26 日
Performance improvements on long topics? Support	30	5596	2018 年8 月 20 日
Slow SQL query causes homepage to load in 2-4 sec Support	16	1700	2018 年7 月 1 日
Private Message Page Performance Development slow-sql	5	1822	2017 年9 月 13 日
Slow Profile Loads with 100GB+ database Feature	43	3052	2021 年1 月 1 日

private-message-topic-tracking-state.json 加载缓慢

相关话题