Some queries that don't finish

I have three queries that have have been running for over 1 hour at 100% CPU.

I identified them with:

sudo -u postgres psql -c "SELECT (now() - xact_start) as t, pid, query FROM pg_stat_activity WHERE state IN ('idle in transaction', 'active') order by t desc;";

And got the output below. I rebuilt earlier today. It’s not an urgent problem as we have a 12 core server and this keeps 1 core occupied full time, worth looking in to though.

       t        |  pid  |                                                                query                                                                 
-----------------+-------+--------------------------------------------------------------------------------------------------------------------------------------
 01:02:34.648308 |  1455 | SELECT                                                                                                                              +
                 |       |   posts.id                                                                                                                          +
                 |       | FROM posts                                                                                                                          +
                 |       | LEFT JOIN post_search_data pd                                                                                                       +
                 |       |   ON pd.locale = 'en'                                                                                                               +
                 |       |   AND pd.version = 2                                                                                                                +
                 |       |   AND pd.post_id = posts.id                                                                                                         +
                 |       | LEFT JOIN topics ON topics.id = posts.topic_id                                                                                      +
                 |       | WHERE pd.post_id IS NULL                                                                                                            +
                 |       | AND topics.id IS NOT NULL                                                                                                           +
                 |       | AND topics.deleted_at IS NULL                                                                                                       +
                 |       | AND posts.raw != ''                                                                                                                 +
                 |       | ORDER BY posts.id DESC                                                                                                              +
                 |       | LIMIT 20000                                                                                                                         +
                 |       | 
 01:02:34.648308 |  1461 | SELECT                                                                                                                              +
                 |       |   posts.id                                                                                                                          +
                 |       | FROM posts                                                                                                                          +
                 |       | LEFT JOIN post_search_data pd                                                                                                       +
                 |       |   ON pd.locale = 'en'                                                                                                               +
                 |       |   AND pd.version = 2                                                                                                                +
                 |       |   AND pd.post_id = posts.id                                                                                                         +
                 |       | LEFT JOIN topics ON topics.id = posts.topic_id                                                                                      +
                 |       | WHERE pd.post_id IS NULL                                                                                                            +
                 |       | AND topics.id IS NOT NULL                                                                                                           +
                 |       | AND topics.deleted_at IS NULL                                                                                                       +
                 |       | AND posts.raw != ''                                                                                                                 +
                 |       | ORDER BY posts.id DESC                                                                                                              +
                 |       | LIMIT 20000                                                                                                                         +
                 |       | 
 01:02:34.648308 |  1462 | SELECT                                                                                                                              +
                 |       |   posts.id                                                                                                                          +
                 |       | FROM posts                                                                                                                          +
                 |       | LEFT JOIN post_search_data pd                                                                                                       +
                 |       |   ON pd.locale = 'en'                                                                                                               +
                 |       |   AND pd.version = 2                                                                                                                +
                 |       |   AND pd.post_id = posts.id                                                                                                         +
                 |       | LEFT JOIN topics ON topics.id = posts.topic_id                                                                                      +
                 |       | WHERE pd.post_id IS NULL                                                                                                            +
                 |       | AND topics.id IS NOT NULL                                                                                                           +
                 |       | AND topics.deleted_at IS NULL                                                                                                       +
                 |       | AND posts.raw != ''                                                                                                                 +
                 |       | ORDER BY posts.id DESC                                                                                                              +
                 |       | LIMIT 20000                          
إعجابَين (2)

The problem is slowly getting worse - the original queries are still running and almost exactly 2 hours later, 3 more started. They’re now occupying 2/12 of my cores. If this continues overnight it will turn into an actual problem.

Update: this query was introduced by this recent commit. Perhaps it’s missing an index on some tables? I have 4.1M records in the posts and post_search_data tables and 1.5M in the topics table, so the double LEFT JOIN could be a problem if something is not indexed correctly.

Not sure if this helps, but here’s an EXPLAIN of the query on my DB:

sudo -u postgres psql discourse -c "EXPLAIN SELECT posts.id FROM posts LEFT JOIN post_search_data pd ON pd.locale='en' AND pd.version=2 AND pd.post_id=posts.id LEFT JOIN topics ON topics.id=posts.topic_id WHERE pd.post_id IS NULL AND topics.id IS NOT NULL AND topics.deleted_at IS NULL AND posts.raw!='' ORDER BY posts.id DESC LIMIT 20000;";
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2000.88..124179.27 rows=20000 width=4)
   ->  Nested Loop Anti Join  (cost=2000.88..29146177.32 rows=4770758 width=4)
         Join Filter: (pd.post_id = posts.id)
         ->  Gather Merge  (cost=1000.88..21351547.22 rows=4770860 width=4)
               Workers Planned: 2
               ->  Nested Loop  (cost=0.86..20799871.57 rows=1987858 width=4)
                     ->  Parallel Index Scan Backward using posts_pkey on posts  (cost=0.43..12078793.28 rows=1988522 width=8)
                           Filter: (raw <> ''::text)
                     ->  Index Only Scan using index_topics_on_id_and_deleted_at on topics  (cost=0.43..4.39 rows=1 width=4)
                           Index Cond: ((id = posts.topic_id) AND (id IS NOT NULL) AND (deleted_at IS NULL))
         ->  Materialize  (cost=1000.00..495214.55 rows=102 width=4)
               ->  Gather  (cost=1000.00..495214.04 rows=102 width=4)
                     Workers Planned: 2
                     ->  Parallel Seq Scan on post_search_data pd  (cost=0.00..494203.84 rows=42 width=4)
                           Filter: (((locale)::text = 'en'::text) AND (version = 2))
إعجاب واحد (1)

This query will get posts who don’t have search data, so we can fix those.

It runs in less than 1 second in Meta, using a very weak database.

Why do you have so many posts without search data? Maybe an import left you in a bad state?

إعجاب واحد (1)

I’m not aware that we do - we have never had problems with search. How can I see if a search is missing search data?

That is possible, but again: we’ve been using Discourse after migrating from vBulletin for almost a year now without a single problem with search. This issue started after I updated today.

إعجاب واحد (1)

I ran the previous version of this query, which only took about 1 second on my system too:

SELECT p2.id
FROM posts p2
LEFT JOIN post_search_data pd ON pd.locale = 'en' 
AND pd.version = 2 
AND p2.id = pd.post_id
WHERE pd.post_id IS NULL

This returned 3139 rows.

إعجاب واحد (1)

Hmmm, the query was changed to be more restrict, and return less stuff to re-index.

Can you run an EXPLAIN ANALYZE of the query?

إعجابَين (2)

I’m running it now, but it’s already taking several minutes - is that expected?

إعجابَين (2)

Oh yeah, it may not work in your current state.

Cut down the limit, like:

explain analyze  SELECT
          posts.id
        FROM posts
        LEFT JOIN post_search_data pd
          ON pd.locale = 'en'
          AND pd.version = 2
          AND pd.post_id = posts.id
        LEFT JOIN topics ON topics.id = posts.topic_id
        WHERE pd.post_id IS NULL
        AND topics.id IS NOT NULL
        AND topics.deleted_at IS NULL
        AND posts.raw != ''
        ORDER BY posts.id DESC
        LIMIT 200;
إعجاب واحد (1)

Great, that finished very fast:

sudo -u postgres psql discourse -c "EXPLAIN ANALYZE SELECT posts.id FROM posts LEFT JOIN post_search_data pd ON pd.locale='en' AND pd.version=2 AND pd.post_id=posts.id LEFT JOIN topics ON topics.id=posts.topic_id WHERE pd.post_id IS NULL AND topics.id IS NOT NULL AND topics.deleted_at IS NULL AND posts.raw!='' ORDER BY posts.id DESC LIMIT 200;";
                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.29..2470.34 rows=200 width=4) (actual time=2.318..46.698 rows=200 loops=1)
   ->  Nested Loop Anti Join  (cost=1.29..58896117.57 rows=4770758 width=4) (actual time=2.317..46.664 rows=200 loops=1)
         ->  Nested Loop  (cost=0.86..33044192.61 rows=4770860 width=4) (actual time=0.132..39.384 rows=2238 loops=1)
               ->  Index Scan Backward using posts_pkey on posts  (cost=0.43..12113605.04 rows=4772452 width=8) (actual time=0.075..27.719 rows=2239 loops=1)
                     Filter: (raw <> ''::text)
                     Rows Removed by Filter: 9
               ->  Index Only Scan using index_topics_on_id_and_deleted_at on topics  (cost=0.43..4.39 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=2239)
                     Index Cond: ((id = posts.topic_id) AND (id IS NOT NULL) AND (deleted_at IS NULL))
                     Heap Fetches: 2106
         ->  Index Scan using posts_search_pkey on post_search_data pd  (cost=0.43..5.42 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=2238)
               Index Cond: (post_id = posts.id)
               Filter: (((locale)::text = 'en'::text) AND (version = 2))
               Rows Removed by Filter: 0
 Planning time: 1.923 ms
 Execution time: 46.760 ms
(15 rows)
إعجاب واحد (1)

That’s strange.

Can you push that limit to 500? Maybe 1000?

Also, you can kill the old running queries and run a vacuum over posts, topics and post_search_data ?

إعجابَين (2)

Once I push it to 500 it’s taking a very long time again (still not finished).

I need a little guidance on killing the processes and doing the vacuum, sorry :slight_smile:

إعجاب واحد (1)

The OP has the “stuck” queries with the PIDs of each one.

You can use SELECT pg_cancel_backend(__pid__); to ask the query to stop nicely, and SELECT pg_terminate_backend(__pid__); to ask it to stop NOW. (where __pid__ is a number)

For vacuum:

VACUUM VERBOSE ANALYZE posts;
VACUUM VERBOSE ANALYZE topics;
VACUUM VERBOSE ANALYZE  post_search_data;
3 إعجابات

Ok done, all the dusty corners of my DB have been vacuumed!

The EXPLAIN ANALYZE was a TON faster now, I could even run it with the 20,000 limit. Perhaps the vacuum did the trick?

sudo -u postgres psql discourse -c "EXPLAIN ANALYZE SELECT posts.id FROM posts LEFT JOIN post_search_data pd ON pd.locale='en' AND pd.version=2 AND pd.post_id=posts.id LEFT JOIN topics ON topics.id=posts.topic_id WHERE pd.post_id IS NULL AND topics.id IS NOT NULL AND topics.deleted_at IS NULL AND posts.raw!='' ORDER BY posts.id DESC LIMIT 20000;";
                                                                             QUERY PLAN                                                                              
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1440033.74..1440033.75 rows=1 width=4) (actual time=4549.789..4550.293 rows=3077 loops=1)
   ->  Sort  (cost=1440033.74..1440033.75 rows=1 width=4) (actual time=4549.788..4550.104 rows=3077 loops=1)
         Sort Key: posts.id DESC
         Sort Method: quicksort  Memory: 241kB
         ->  Nested Loop  (cost=599461.35..1440033.73 rows=1 width=4) (actual time=2535.778..4548.345 rows=3077 loops=1)
               ->  Gather  (cost=599460.92..1440030.82 rows=1 width=8) (actual time=2535.705..4734.830 rows=3118 loops=1)
                     Workers Planned: 2
                     Workers Launched: 2
                     ->  Hash Anti Join  (cost=598460.92..1439030.72 rows=1 width=8) (actual time=2551.745..4510.171 rows=1039 loops=3)
                           Hash Cond: (posts.id = pd.post_id)
                           ->  Parallel Seq Scan on posts  (cost=0.00..798106.15 rows=1718754 width=8) (actual time=0.022..810.363 rows=1377851 loops=3)
                                 Filter: (raw <> ''::text)
                                 Rows Removed by Filter: 169
                           ->  Hash  (cost=530651.93..530651.93 rows=4133119 width=4) (actual time=2535.564..2535.564 rows=4130922 loops=3)
                                 Buckets: 1048576  Batches: 8  Memory Usage: 26319kB
                                 ->  Seq Scan on post_search_data pd  (cost=0.00..530651.93 rows=4133119 width=4) (actual time=0.037..1846.142 rows=4130922 loops=3)
                                       Filter: (((locale)::text = 'en'::text) AND (version = 2))
                                       Rows Removed by Filter: 3104
               ->  Index Only Scan using index_topics_on_id_and_deleted_at on topics  (cost=0.43..2.91 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=3118)
                     Index Cond: ((id = posts.topic_id) AND (id IS NOT NULL) AND (deleted_at IS NULL))
                     Heap Fetches: 709
 Planning time: 2.687 ms
 Execution time: 4750.903 ms
(23 rows)
إعجاب واحد (1)

EXPLAIN ANALYZE actually runs the query, so you can for sure just run it now :wink:.

Looks like the poor database was missing some care. :face_with_thermometer:

3 إعجابات

Interesting - I didn’t realise I needed to do any of those actions as a Discourse admin. Is there a HOWTO about this, by any chance?

إعجاب واحد (1)

PostgreSQL runs vacuum automatically in the background.

In normal usage, we never need to manually go and interfere on that.

However, imports, restoring backups and etc, can generate a lot of cruft in the database, and if the resources aren’t adequate it can generate garbage faster than the autovacuum can clean it out.

6 إعجابات

That’s good to know. My server isn’t overloaded, so I’d be surprised if I didn’t have enough resources - I usually have a few cores idling, plenty of free RAM and SSD drives. Is there a way to check if postgres is trying to run these vacuum jobs, but fails for some reason?

By the way, I don’t want to keep on taking your time - you’ve been a life saver! So big hug from the Netherlands and don’t worry about my last question here if you want to move on to other things :slight_smile:

4 إعجابات