The problem is slowly getting worse - the original queries are still running and almost exactly 2 hours later, 3 more started. They’re now occupying 2/12 of my cores. If this continues overnight it will turn into an actual problem.
Update: this query was introduced by this recent commit. Perhaps it’s missing an index on some tables? I have 4.1M records in the posts and post_search_data tables and 1.5M in the topics table, so the double LEFT JOIN could be a problem if something is not indexed correctly.
Not sure if this helps, but here’s an EXPLAIN of the query on my DB:
sudo -u postgres psql discourse -c "EXPLAIN SELECT posts.id FROM posts LEFT JOIN post_search_data pd ON pd.locale='en' AND pd.version=2 AND pd.post_id=posts.id LEFT JOIN topics ON topics.id=posts.topic_id WHERE pd.post_id IS NULL AND topics.id IS NOT NULL AND topics.deleted_at IS NULL AND posts.raw!='' ORDER BY posts.id DESC LIMIT 20000;";
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2000.88..124179.27 rows=20000 width=4)
-> Nested Loop Anti Join (cost=2000.88..29146177.32 rows=4770758 width=4)
Join Filter: (pd.post_id = posts.id)
-> Gather Merge (cost=1000.88..21351547.22 rows=4770860 width=4)
Workers Planned: 2
-> Nested Loop (cost=0.86..20799871.57 rows=1987858 width=4)
-> Parallel Index Scan Backward using posts_pkey on posts (cost=0.43..12078793.28 rows=1988522 width=8)
Filter: (raw <> ''::text)
-> Index Only Scan using index_topics_on_id_and_deleted_at on topics (cost=0.43..4.39 rows=1 width=4)
Index Cond: ((id = posts.topic_id) AND (id IS NOT NULL) AND (deleted_at IS NULL))
-> Materialize (cost=1000.00..495214.55 rows=102 width=4)
-> Gather (cost=1000.00..495214.04 rows=102 width=4)
Workers Planned: 2
-> Parallel Seq Scan on post_search_data pd (cost=0.00..494203.84 rows=42 width=4)
Filter: (((locale)::text = 'en'::text) AND (version = 2))