Ever since FEATURE: new watched_precedence_over_muted setting (#22252) · discourse/discourse@9cf981f · GitHub (presumably!) we’re having Sidekiq queues fill up/OOM with stuck PostAlert jobs:
Given this commit has already had a similar regression (FIX: error when CategoryList tried to find relevant topics by lis2 · Pull Request #22339 · discourse/discourse · GitHub ) it’s fairly suspect - we’ll try downgrading to a commit before the aforementioned one now though and will report back with the results.
2 Likes
Moving downward to 4f7f9ef87cbcd574144f657dd43b7e705d98ff8e
indeed solved the stuck PostAlert sidekiq workers and the OOM concerns: the few PostAlert jobs that do get queued now complete within a matter of seconds, not minutes.
2 Likes
I just saw FIX: improve performance of post alerter job (#22378) · discourse/discourse@7a204e7 · GitHub be pushed - will try this out later if this is meant to be a fix for this issue.
2 Likes
Hey, yes, this hopefully will solve this issue. I tested this solution on 3 forums, and a new query was always much faster and didn’t clog servers.
Thank you for reporting this issue, and please let me know if you still experience problems.
4 Likes
Sadly, even with these changes, we’re still seeing PostAlert jobs take way longer than before and hang the Sidekiq workers entirely while processing.
(we have 10M+ user rows and some categories muted by default, so there’s a lot of mutes set up!)
Reverting the three commits that recently touched this job and restarting the container, meanwhile, has the jobs complete fine.
3 Likes
Thank you, I will have another look
I just merged another try to improve performance of this job:
discourse:main
← lis2:improve_post_alerter_job
opened 03:25AM - 07 Jul 23 UTC
Simplified query based on SiteSettings to join only relevant user_options rows. …
In addition, index was added to 'watched_precedence_over_muted` column in `user_options` table to speed up query
Original
```
SELECT "users".* FROM "users" LEFT JOIN user_options ON user_options.user_id = users.id WHERE (id IN (SELECT user_id FROM category_users WHERE category_id = 231 AND notification_level = 0
UNION
SELECT user_id FROM tag_users tu JOIN topic_tags tt ON tt.tag_id = tu.tag_id AND tt.topic_id = 18799 AND tu.notification_level = 0
EXCEPT
SELECT user_id FROM topic_users tus WHERE tus.topic_id = 18799 AND tus.notification_level = 3
)) AND (user_options.watched_precedence_over_muted IS false OR (user_options.watched_precedence_over_muted IS NULL AND true))
Nested Loop Left Join (cost=42034.83..117264.12 rows=21226 width=356) (actual time=39.473..39.550 rows=1 loops=1)
Filter: ((user_options.watched_precedence_over_muted IS FALSE) OR (user_options.watched_precedence_over_muted IS NULL))
-> Nested Loop (cost=42034.40..95792.76 rows=42240 width=356) (actual time=39.455..39.531 rows=1 loops=1)
-> Subquery Scan on "ANY_subquery" (cost=42033.97..43307.96 rows=42240 width=4) (actual time=39.435..39.511 rows=1 loops=1)
-> HashSetOp Except (cost=42033.97..42885.56 rows=42240 width=8) (actual time=39.433..39.508 rows=1 loops=1)
-> Append (cost=42033.97..42779.90 rows=42264 width=8) (actual time=39.155..39.278 rows=2 loops=1)
-> HashAggregate (cost=42033.97..42456.37 rows=42240 width=8) (actual time=39.154..39.238 rows=1 loops=1)
Group Key: category_users.user_id
Batches: 1 Memory Usage: 1561kB
-> Append (cost=0.44..41928.37 rows=42240 width=4) (actual time=38.968..38.996 rows=1 loops=1)
-> Index Scan using idx_category_users_category_id_user_id on category_users (cost=0.44..41290.25 rows=42239 width=4) (actual time=38.967..38.969 rows=1 loops=1)
Index Cond: (category_id = 231)
Filter: (notification_level = 0)
Rows Removed by Filter: 2
-> Nested Loop (cost=0.15..4.52 rows=1 width=4) (actual time=0.025..0.025 rows=0 loops=1)
-> Seq Scan on topic_tags tt (cost=0.00..1.04 rows=2 width=4) (actual time=0.007..0.007 rows=2 loops=1)
Filter: (topic_id = 18799)
Rows Removed by Filter: 1
-> Index Only Scan using idx_tag_users_ix2 on tag_users tu (cost=0.15..1.73 rows=1 width=8) (actual time=0.007..0.007 rows=0 loops=2)
Index Cond: ((tag_id = tt.tag_id) AND (notification_level = 0))
Heap Fetches: 0
-> Subquery Scan on "*SELECT* 3" (cost=0.43..112.21 rows=24 width=8) (actual time=0.023..0.038 rows=1 loops=1)
-> Index Scan using index_topic_users_on_topic_id_and_user_id on topic_users tus (cost=0.43..111.97 rows=24 width=4) (actual time=0.022..0.036 rows=1 loops=1)
Index Cond: (topic_id = 18799)
Filter: (notification_level = 3)
Rows Removed by Filter: 13
-> Index Scan using users_pkey on users (cost=0.43..1.24 rows=1 width=356) (actual time=0.014..0.014 rows=1 loops=1)
Index Cond: (id = "ANY_subquery".user_id)
-> Index Scan using index_user_options_on_user_id_and_default_calendar on user_options (cost=0.43..0.50 rows=1 width=5) (actual time=0.013..0.013 rows=1 loops=1)
Index Cond: (user_id = users.id)
Planning Time: 0.869 ms
Execution Time: 46.053 ms
```
Tweaked
```
EXPLAIN ANALYZE SELECT "users".* FROM "users" WHERE (id IN (SELECT uo.user_id FROM user_options uo
LEFT JOIN topic_users tus ON tus.user_id = uo.user_id AND tus.topic_id = 18799
LEFT JOIN category_users cu ON cu.user_id = uo.user_id AND cu.category_id = 231
LEFT JOIN tag_users tu ON tu.user_id = uo.user_id
JOIN topic_tags tt ON tt.tag_id = tu.tag_id AND tt.topic_id = 18799
WHERE
(tus.id IS NULL OR tus.notification_level != 3)
AND (cu.notification_level = 0
OR
tu.notification_level = 0)
AND (uo.watched_precedence_over_muted IS NULL OR uo.watched_precedence_over_muted IS false)
));
Nested Loop (cost=56.73..59.06 rows=5 width=356) (actual time=0.160..0.161 rows=0 loops=1)
-> HashAggregate (cost=56.30..56.35 rows=5 width=8) (actual time=0.160..0.161 rows=0 loops=1)
Group Key: uo.user_id
Batches: 1 Memory Usage: 24kB
-> Nested Loop Left Join (cost=1.45..56.29 rows=5 width=8) (actual time=0.157..0.158 rows=0 loops=1)
Filter: ((tus.id IS NULL) OR (tus.notification_level <> 3))
-> Nested Loop Left Join (cost=1.01..50.45 rows=6 width=8) (actual time=0.157..0.158 rows=0 loops=1)
Filter: ((cu.notification_level = 0) OR (tu.notification_level = 0))
-> Nested Loop (cost=0.58..44.40 rows=7 width=12) (actual time=0.157..0.157 rows=0 loops=1)
-> Nested Loop (cost=0.15..11.73 rows=14 width=8) (actual time=0.157..0.157 rows=0 loops=1)
-> Seq Scan on topic_tags tt (cost=0.00..1.04 rows=2 width=4) (actual time=0.147..0.147 rows=2 loops=1)
Filter: (topic_id = 18799)
Rows Removed by Filter: 1
-> Index Only Scan using idx_tag_users_ix2 on tag_users tu (cost=0.15..5.28 rows=7 width=12) (actual time=0.003..0.003 rows=0 loops=2)
Index Cond: (tag_id = tt.tag_id)
Heap Fetches: 0
-> Index Scan using index_user_options_on_user_id_and_default_calendar on user_options uo (cost=0.43..2.33 rows=1 width=4) (never executed)
Index Cond: (user_id = tu.user_id)
Filter: ((watched_precedence_over_muted IS NULL) OR (watched_precedence_over_muted IS FALSE))
-> Index Scan using idx_category_users_user_id_category_id on category_users cu (cost=0.43..0.85 rows=1 width=8) (never executed)
Index Cond: ((user_id = uo.user_id) AND (category_id = 231))
-> Index Scan using index_topic_users_on_topic_id_and_user_id on topic_users tus (cost=0.43..0.96 rows=1 width=12) (never executed)
Index Cond: ((topic_id = 18799) AND (user_id = uo.user_id))
-> Index Scan using users_pkey on users (cost=0.43..0.54 rows=1 width=356) (never executed)
Index Cond: (id = uo.user_id)
Planning Time: 3.636 ms
Execution Time: 0.253 ms
```
I tested in on few instances and it was fine, but they were smaller than yours.
If it is still failing with this commit, could you provide me EXPLAIN ANALYSE
report? Script to generate it:
topic = Topic.last
user_option_sql_fragment =
if SiteSetting.watched_precedence_over_muted
<<~SQL
INTERSECT
SELECT user_id FROM user_options WHERE user_options.watched_precedence_over_muted IS false
SQL
else
<<~SQL
EXCEPT
SELECT user_id FROM user_options WHERE user_options.watched_precedence_over_muted IS true
SQL
end
user_ids_sql = <<~SQL
(
SELECT user_id FROM category_users WHERE category_id = #{topic.category_id.to_i} AND notification_level = #{CategoryUser.notification_levels[:muted]}
UNION
SELECT user_id FROM tag_users tu JOIN topic_tags tt ON tt.tag_id = tu.tag_id AND tt.topic_id = #{topic.id} WHERE tu.notification_level = #{TagUser.notification_levels[:muted]}
EXCEPT
SELECT user_id FROM topic_users tus WHERE tus.topic_id = #{topic.id} AND tus.notification_level = #{TopicUser.notification_levels[:watching]}
)
#{user_option_sql_fragment}
SQL
sql = User.where("id IN (#{user_ids_sql})").to_sql
sql_with_index = <<SQL
EXPLAIN ANALYZE #{sql};
SQL
result = ActiveRecord::Base.connection.execute("#{sql_with_index}")
puts sql_with_index
result.each do |r|
puts r.values
end
It would help me to find missing index or which part of this query is so slow.
3 Likes
This change seems to have returned PostAlert jobs to their normal duration and not hanging the Sidekiq instances. Yay!
3 Likes
Falco
(Falco)
Closed
July 17, 2023, 11:00am
11
This topic was automatically closed after 3 days. New replies are no longer allowed.