We recently been having CPU overload.
We’ve managed to pinpoint it to one query that keeps running simultaneously for hours (see below).
The task running is postmaster - once all are killed CPU is reduced until it is rebuilt every five minutes.
SELECT “users”.“id”, ARRAY_AGG(ARRAY [uccm.id, c_msg.id]) AS memberships_with_unread_messages
FROM “users”
INNER JOIN “user_options” ON “user_options”.“user_id” = “users”.“id”
INNER JOIN user_chat_channel_memberships uccm ON uccm.user_id = users.id
INNER JOIN chat_channels cc ON cc.id = uccm.chat_channel_id
INNER JOIN chat_messages c_msg ON c_msg.chat_channel_id = uccm.chat_channel_id
LEFT OUTER JOIN chat_mentions c_mentions ON c_mentions.chat_message_id = c_msg.id
WHERE “user_options”.“chat_enabled” = TRUE
AND “user_options”.“chat_email_frequency” = 1
AND (users.last_seen_at < ‘2023 - 11 - 15 09:54:26.931723’)
AND (c_msg.deleted_at IS NULL AND c_msg.user_id <> users.id)
AND (c_msg.created_at > ‘2023 - 11 - 08 10:09:26.931848’)
AND ((uccm.last_read_message_id IS NULL OR c_msg.id > uccm.last_read_message_id) AND
(uccm.last_unread_mention_when_emailed_id IS NULL OR c_msg.id > uccm.last_unread_mention_when_emailed_id) AND
(uccm.user_id = c_mentions.user_id AND uccm.following IS true AND
cc.chatable_type = ‘Category’
Can anyone help or suggest a solution?
More info:
running 3.2.0.beta4-dev, but this has been happening for at least a few months
I worked on another similar query and that fix was local to that query, so it probably won’t help here. Still, it’s worth updating to latest, there can be other optimizations.
We had a client turn on @all mentions in chat, with two channels of 16k users. That was fun!
After 19 mentions to all they quickly turned it off again. 12 hours later their (beefy) system was still at high CPU load.
The problem was resolved by clearing the 320k offending records from the mentions table
delete from chat_mentions
where chat_message_id in (
(select chat_message_id
from chat_mentions
group by chat_message_id having count(*) > 100)
);
but we were still puzzled why this was causing so much issues.
There were a number of instances of this query still running.
Because the rows were already gone, an explain did not give good results, but it is clear that it was doing some heavy grinding. All indexes are in place.
We’ve also seen chat messages disappear for up to a few minutes before reappearing.
Generally I think there’s a lot to improve in the chat, though I totally like the direction it’s going in. My main concern is how a chat can essentially cause an overload that won’t stop. We turned off the ability to mention, and that didn’t work either.
Is there a way to stop the query I mentioned from coming back? @andrei@JammyDodger
Hey @sok777 , thanks for following up — we’ve got a few engineers looking into this, and we’ll check in with you here once we have an update for you. Appreciate your patience in the meantime!
We had one message with @all which we deleted. However this has not solved the issue. The query keeps coming back every 5 minutes and is taking forever to a point that there’s a queue of many other queries.
Once we disabled the chat, the CPU load went down to 15%. Crazy
The only reason that query can take long is because of the many entries in the chat_mentions table. Have you tried the generic query I posted? Maybe it’s not @all but another large group that is mentioned.
The problematic query scans for all mentions from chats in the past 7 days.
You should also kill all the existing instances of the query.
Yes we killed then but they keep coming back.
We checked all mentions above 100 and we didn’t have any. Not sure why else it would happen.
Sounds to me like there’s a fundamental issue with the query but I could be wrong.
@sok777 I’m going to need a bit of your assistance with this. I’d like to take a look at the query plan Postrges uses on your site. Could you please run this against your database and share the results:
EXPLAIN VERBOSE SELECT "users"."id", ARRAY_AGG(ARRAY [uccm.id, c_msg.id]) AS memberships_with_unread_messages
FROM "users"
INNER JOIN "user_options" ON "user_options"."user_id" = "users"."id"
INNER JOIN "group_users" ON "group_users"."user_id" = "users"."id"
INNER JOIN "groups" ON "groups"."id" = "group_users"."group_id"
INNER JOIN user_chat_channel_memberships uccm ON uccm.user_id = users.id
INNER JOIN chat_channels cc ON cc.id = uccm.chat_channel_id
INNER JOIN chat_messages c_msg ON c_msg.chat_channel_id = uccm.chat_channel_id
LEFT OUTER JOIN chat_mentions c_mentions ON c_mentions.chat_message_id = c_msg.id
WHERE "user_options"."chat_enabled" = TRUE
AND "user_options"."chat_email_frequency" = 1
AND (users.last_seen_at < '2023-11-27 08:00:00.0000000')
AND (c_msg.deleted_at IS NULL AND c_msg.user_id <> users.id)
AND (c_msg.created_at > '2023-11-20 08:00:00.0000000')
AND ((uccm.last_read_message_id IS NULL OR c_msg.id > uccm.last_read_message_id) AND
(uccm.last_unread_mention_when_emailed_id IS NULL OR c_msg.id > uccm.last_unread_mention_when_emailed_id) AND
(
(uccm.user_id = c_mentions.user_id AND uccm.following IS true AND cc.chatable_type = 'Category') OR
(cc.chatable_type = 'DirectMessage')
)
)
GROUP BY users.id, uccm.user_id
ORDER BY "users"."id" ASC
LIMIT 1000;