Slow loading on private-message-topic-tracking-state.json

Since 2.8.0.beta5, loading a group PM inbox takes 4+ seconds with most time (2-3 seconds) taken on private-message-topic-tracking-state.json’s huge SQL query.

We have ~800k topics, ~3.8M users, and probably a large graph of unread-PMs-to-users, but I’m also not really seeing much of note in the query analysis (I’m probably just bad at reading it).

This was fine before beta5, now PM inboxes are very difficult and annoying to keep track of.

1 Like

Hmmm, this is averaging 200ms here on Meta for us.

That said it’s quite sus that going to messages in your profile now makes 3 distinct fetches and that the one you mention is returning a 92kB JSON with 900 objects for me :scream:

2 Likes

It does seem to also invalidate miniprofiler’s state halfway through so I can’t even grab the SQL query to properly EXPLAIN it.

2 Likes

Yeah, that route transition is not acting normally.

By the way I extracted a plan here: ieWq | explain.depesz.com

3 Likes

This fix should bring it down to 13ms or so.

Do note that the more unread you have, the more expensive the query is going to be.

@sam Another thing I need to fix is that dismissing unread doesn’t update the first_unread_at optimisations we have for normal and PM topics.

4 Likes

1.73 seconds after cherry-picking this change and restarting unicorn, but the unread-at time for myself and my groups is probably quite a bit behind (I’ve only got 59 new and 4 unread visibly for this inbox - is this route per user? though even for ‘all messages’ I’ve got 61 new and 7 unread)

1 Like

Does the unread-at time correlate to the unreads in your inboxes?

1 Like
discourse=# select first_unread_pm_at from group_users where user_id = 2334 order by first_unread_pm_at asc;
     first_unread_pm_at
----------------------------
 2021-09-03 06:54:13.493514
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
 2021-09-03 07:05:33.138986
(17 rows)
discourse=# select first_unread_pm_at from user_stats where user_id = 2334;
     first_unread_pm_at
----------------------------
 2021-09-03 07:05:00.565522
(1 row)

Curiously, the range isn’t even that high, so some other condition is being slow (though, mysteriously, it did speed to up 700ms now after dismissing a few unreads, which is bearable, I’m not sure what my first_unread_pm_at range was before, maybe a few weeks?).

Is there any way for me to get the query from miniprofiler after it refreshes itself removing all the old requests on the messages page so I can run an EXPLAIN?

This is indeed quite odd but I can’t reproduce this in the development environment. Will spend some time looking into this next week.

1 Like

Fixed it in

1 Like

Thanks! I’ll try to remember to check again next rebuild/restart window.

1 Like

Got a query plan/EXPLAIN ANALYZE result (back up to 1.3s :confused:): CvUX | explain.depesz.com

After a reindex index concurrently index_topics_on_timestamps_private; it got a bit faster (768ms), so this might just be an artifact of a lot of former DMs (the narrative bot, before cleaning up, was polluting the DM table a lot!)… though it’s still slow on that same index

Are you able to run Topic.where(archetype: Archetype.private_message).count for me in a Rails console?

By the way, the query is done in an async manner now so it will not block the loading of the messages page or a private message.

[1] pry(main)> Topic.where(archetype: Archetype.private_message).count
=> 543855

Could be a lot worse, we did prune all the leftover narrative bot topics recently. :sweat_smile: Still more than I expected!

Ah, I wondered why that query was shown in the miniprofiler only after a while now. That’s good!

We recently updated again to pick up this feature FEATURE: New and Unread messages for user personal messages. by tgxworld · Pull Request #13603 · discourse/discourse · GitHub (ie currently on 2.8.0beta6)

It seems like the get all pms for user which eventually leads to somewhere FEATURE: New and Unread messages for user personal messages. by tgxworld · Pull Request #13603 · discourse/discourse · GitHub results in a query like this →

SELECT "topics"."id", "topics"."title", "topics"."last_posted_at", "topics"."created_at", "topics"."updated_at", "topics"."views", "topics"."posts_count", "topics"."user_id", "topics"."last_post_user_id", "topics"."reply_count", "topics"."featured_user1_id", "topics"."featured_user2_id", "topics"."featured_user3_id", "topics"."deleted_at", "topics"."highest_post_number", "topics"."like_count", "topics"."incoming_link_count", "topics"."category_id", "topics"."visible", "topics"."moderator_posts_count", "topics"."closed", "topics"."archived", "topics"."bumped_at", "topics"."has_summary", "topics"."archetype", "topics"."featured_user4_id", "topics"."notify_moderators_count", "topics"."spam_count", "topics"."pinned_at", "topics"."score", "topics"."percent_rank", "topics"."subtype", "topics"."slug", "topics"."deleted_by_id", "topics"."participant_count", "topics"."word_count", "topics"."excerpt", "topics"."pinned_globally", "topics"."pinned_until", "topics"."fancy_title", "topics"."highest_staff_post_number", "topics"."featured_link", "topics"."reviewable_score", "topics"."image_upload_id", "topics"."slow_mode_seconds" FROM "topics" LEFT OUTER JOIN topic_users AS tu ON (topics.id = tu.topic_id AND tu.user_id = 1234) LEFT JOIN group_archived_messages gm ON gm.topic_id = topics.id
LEFT JOIN user_archived_messages um
ON um.user_id = 1234
AND um.topic_id = topics.id WHERE "topics"."deleted_at" IS NULL AND (topics.id IN (
SELECT topic_id
FROM topic_allowed_users
WHERE user_id = 1234
 UNION ALL
SELECT topic_id FROM topic_allowed_groups
 WHERE group_id IN (
SELECT group_id FROM group_users WHERE user_id = 1234
  )
)) AND "topics"."archetype" = 'private_message' AND "topics"."visible" = TRUE AND (um.user_id IS NULL AND gm.topic_id IS NULL) ORDER BY topics.bumped_at DESC LIMIT 30;

Basically hangs our RDS server. I think a member of our team suggested that we filter out data with whatever indexes we have before joining. This query runs < 100ms.

SELECT "topics_filter"."id", "topics_filter"."title", "topics_filter"."last_posted_at", "topics_filter"."created_at", "topics_filter"."updated_at",
"topics_filter"."views", "topics_filter"."posts_count", "topics_filter"."user_id", "topics_filter"."last_post_user_id", "topics_filter"."reply_count", "topics_filter"."featured_user1_id",
"topics_filter"."featured_user2_id", "topics_filter"."featured_user3_id", "topics_filter"."deleted_at", "topics_filter"."highest_post_number", "topics_filter"."like_count",
"topics_filter"."incoming_link_count", "topics_filter"."category_id", "topics_filter"."visible", "topics_filter"."moderator_posts_count", "topics_filter"."closed",
"topics_filter"."archived", "topics_filter"."bumped_at", "topics_filter"."has_summary", "topics_filter"."archetype", "topics_filter"."featured_user4_id", "topics_filter"."notify_moderators_count",
"topics_filter"."spam_count", "topics_filter"."pinned_at", "topics_filter"."score", "topics_filter"."percent_rank", "topics_filter"."subtype", "topics_filter"."slug",
"topics_filter"."deleted_by_id", "topics_filter"."participant_count", "topics_filter"."word_count", "topics_filter"."excerpt", "topics_filter"."pinned_globally",
"topics_filter"."pinned_until", "topics_filter"."fancy_title", "topics_filter"."highest_staff_post_number", "topics_filter"."featured_link", "topics_filter"."reviewable_score",
"topics_filter"."image_upload_id", "topics_filter"."slow_mode_seconds"
FROM (select * from "topics"
      where "topics"."deleted_at" IS NULL
      and (topics.id IN ( SELECT topic_id FROM topic_allowed_users WHERE user_id = 1234
                          UNION ALL
                          SELECT topic_id FROM topic_allowed_groups  WHERE group_id IN ( SELECT group_id FROM group_users WHERE user_id = 1234 )
                         ))
      AND "topics"."archetype" = 'private_message'
      AND "topics"."visible" = TRUE
      )   as topics_filter
LEFT OUTER JOIN (select topic_id from topic_users where user_id = 1234) AS tu ON (topics_filter.id = tu.topic_id )
LEFT JOIN group_archived_messages gm ON gm.topic_id = topics_filter.id
LEFT JOIN ( select user_id, topic_id from user_archived_messages where user_id = 1234) um
ON um.topic_id = topics_filter.id
WHERE (um.user_id IS NULL AND gm.topic_id IS NULL) ORDER BY topics_filter.bumped_at DESC LIMIT 30;

For reference topics > 10M rows, topic_users > 50M rows.

Is it possible to restructure the query to address these issues somehow?

Edit:

Just to clarify, when we inspect our longer running queries > 2min or even 5min. These are the only queries that show up. With the newest change, our Postgres CPU is constantly at 99%. When I terminate these queries we do see that the CPU usage returns to normal.

3 Likes

@tgxworld sorry to tag you directly, but since you were the author I was wondering if you had any suggestions/feedback.

1 Like

As I recall, @sam always described it as a risky and dangerous feature. Big regressions here would not surprise me.

We found this bug topic after posting this: Slow loading on private-message-topic-tracking-state.json - #9 by tgxworld

It looks like there may be a commit after the one tagged v2.8.0beta6 that seems to address some of this performance problem. We’re going to try rebuild targeting that specific commit reference and seeing if it alleviates anything.

2 Likes

What is your current RDS instance type and total database size?

1 Like

db.m5.2xlarge

~ 250GB in terms of disk usage.

1 Like