PostAlerter job hangs/OOMs as of late

Ever since FEATURE: new watched_precedence_over_muted setting (#22252) · discourse/discourse@9cf981f · GitHub (presumably!) we’re having Sidekiq queues fill up/OOM with stuck PostAlert jobs:

Given this commit has already had a similar regression (FIX: error when CategoryList tried to find relevant topics by lis2 · Pull Request #22339 · discourse/discourse · GitHub) it’s fairly suspect - we’ll try downgrading to a commit before the aforementioned one now though and will report back with the results.

2 Likes

Moving downward to 4f7f9ef87cbcd574144f657dd43b7e705d98ff8e indeed solved the stuck PostAlert sidekiq workers and the OOM concerns: the few PostAlert jobs that do get queued now complete within a matter of seconds, not minutes.

2 Likes

Ping @kris.kotlarek

1 Like

I just saw FIX: improve performance of post alerter job (#22378) · discourse/discourse@7a204e7 · GitHub be pushed - will try this out later if this is meant to be a fix for this issue.

2 Likes

Hey, yes, this hopefully will solve this issue. I tested this solution on 3 forums, and a new query was always much faster and didn’t clog servers.
Thank you for reporting this issue, and please let me know if you still experience problems.

4 Likes

Sadly, even with these changes, we’re still seeing PostAlert jobs take way longer than before and hang the Sidekiq workers entirely while processing. :frowning:

(we have 10M+ user rows and some categories muted by default, so there’s a lot of mutes set up!)

Reverting the three commits that recently touched this job and restarting the container, meanwhile, has the jobs complete fine.

3 Likes

Thank you, I will have another look

I just merged another try to improve performance of this job:

I tested in on few instances and it was fine, but they were smaller than yours.

If it is still failing with this commit, could you provide me EXPLAIN ANALYSE report? Script to generate it:

topic = Topic.last
user_option_sql_fragment =
  if SiteSetting.watched_precedence_over_muted
    <<~SQL
    INTERSECT
    SELECT user_id FROM user_options WHERE user_options.watched_precedence_over_muted IS false
    SQL
  else
    <<~SQL
    EXCEPT
    SELECT user_id FROM user_options WHERE user_options.watched_precedence_over_muted IS true
    SQL
  end
user_ids_sql = <<~SQL
  (
    SELECT user_id FROM category_users WHERE category_id = #{topic.category_id.to_i} AND notification_level = #{CategoryUser.notification_levels[:muted]}
    UNION
    SELECT user_id FROM tag_users tu JOIN topic_tags tt ON tt.tag_id = tu.tag_id AND tt.topic_id = #{topic.id} WHERE tu.notification_level = #{TagUser.notification_levels[:muted]}
    EXCEPT
    SELECT user_id FROM topic_users tus WHERE tus.topic_id = #{topic.id} AND tus.notification_level = #{TopicUser.notification_levels[:watching]}
  )
  #{user_option_sql_fragment}
SQL
sql = User.where("id IN (#{user_ids_sql})").to_sql

sql_with_index = <<SQL
EXPLAIN ANALYZE #{sql};
SQL
result = ActiveRecord::Base.connection.execute("#{sql_with_index}")
puts sql_with_index
result.each do |r|
  puts r.values
end

It would help me to find missing index or which part of this query is so slow.

3 Likes

This change seems to have returned PostAlert jobs to their normal duration and not hanging the Sidekiq instances. Yay!

3 Likes

This topic was automatically closed after 3 days. New replies are no longer allowed.