Some queries that don't finish

I have three queries that have have been running for over 1 hour at 100% CPU.

I identified them with:

sudo -u postgres psql -c "SELECT (now() - xact_start) as t, pid, query FROM pg_stat_activity WHERE state IN ('idle in transaction', 'active') order by t desc;";

And got the output below. I rebuilt earlier today. It’s not an urgent problem as we have a 12 core server and this keeps 1 core occupied full time, worth looking in to though.

       t        |  pid  |                                                                query                                                                 
-----------------+-------+--------------------------------------------------------------------------------------------------------------------------------------
 01:02:34.648308 |  1455 | SELECT                                                                                                                              +
                 |       |   posts.id                                                                                                                          +
                 |       | FROM posts                                                                                                                          +
                 |       | LEFT JOIN post_search_data pd                                                                                                       +
                 |       |   ON pd.locale = 'en'                                                                                                               +
                 |       |   AND pd.version = 2                                                                                                                +
                 |       |   AND pd.post_id = posts.id                                                                                                         +
                 |       | LEFT JOIN topics ON topics.id = posts.topic_id                                                                                      +
                 |       | WHERE pd.post_id IS NULL                                                                                                            +
                 |       | AND topics.id IS NOT NULL                                                                                                           +
                 |       | AND topics.deleted_at IS NULL                                                                                                       +
                 |       | AND posts.raw != ''                                                                                                                 +
                 |       | ORDER BY posts.id DESC                                                                                                              +
                 |       | LIMIT 20000                                                                                                                         +
                 |       | 
 01:02:34.648308 |  1461 | SELECT                                                                                                                              +
                 |       |   posts.id                                                                                                                          +
                 |       | FROM posts                                                                                                                          +
                 |       | LEFT JOIN post_search_data pd                                                                                                       +
                 |       |   ON pd.locale = 'en'                                                                                                               +
                 |       |   AND pd.version = 2                                                                                                                +
                 |       |   AND pd.post_id = posts.id                                                                                                         +
                 |       | LEFT JOIN topics ON topics.id = posts.topic_id                                                                                      +
                 |       | WHERE pd.post_id IS NULL                                                                                                            +
                 |       | AND topics.id IS NOT NULL                                                                                                           +
                 |       | AND topics.deleted_at IS NULL                                                                                                       +
                 |       | AND posts.raw != ''                                                                                                                 +
                 |       | ORDER BY posts.id DESC                                                                                                              +
                 |       | LIMIT 20000                                                                                                                         +
                 |       | 
 01:02:34.648308 |  1462 | SELECT                                                                                                                              +
                 |       |   posts.id                                                                                                                          +
                 |       | FROM posts                                                                                                                          +
                 |       | LEFT JOIN post_search_data pd                                                                                                       +
                 |       |   ON pd.locale = 'en'                                                                                                               +
                 |       |   AND pd.version = 2                                                                                                                +
                 |       |   AND pd.post_id = posts.id                                                                                                         +
                 |       | LEFT JOIN topics ON topics.id = posts.topic_id                                                                                      +
                 |       | WHERE pd.post_id IS NULL                                                                                                            +
                 |       | AND topics.id IS NOT NULL                                                                                                           +
                 |       | AND topics.deleted_at IS NULL                                                                                                       +
                 |       | AND posts.raw != ''                                                                                                                 +
                 |       | ORDER BY posts.id DESC                                                                                                              +
                 |       | LIMIT 20000                          
2 Likes

The problem is slowly getting worse - the original queries are still running and almost exactly 2 hours later, 3 more started. They’re now occupying 2/12 of my cores. If this continues overnight it will turn into an actual problem.

Update: this query was introduced by this recent commit. Perhaps it’s missing an index on some tables? I have 4.1M records in the posts and post_search_data tables and 1.5M in the topics table, so the double LEFT JOIN could be a problem if something is not indexed correctly.

Not sure if this helps, but here’s an EXPLAIN of the query on my DB:

sudo -u postgres psql discourse -c "EXPLAIN SELECT posts.id FROM posts LEFT JOIN post_search_data pd ON pd.locale='en' AND pd.version=2 AND pd.post_id=posts.id LEFT JOIN topics ON topics.id=posts.topic_id WHERE pd.post_id IS NULL AND topics.id IS NOT NULL AND topics.deleted_at IS NULL AND posts.raw!='' ORDER BY posts.id DESC LIMIT 20000;";
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2000.88..124179.27 rows=20000 width=4)
   ->  Nested Loop Anti Join  (cost=2000.88..29146177.32 rows=4770758 width=4)
         Join Filter: (pd.post_id = posts.id)
         ->  Gather Merge  (cost=1000.88..21351547.22 rows=4770860 width=4)
               Workers Planned: 2
               ->  Nested Loop  (cost=0.86..20799871.57 rows=1987858 width=4)
                     ->  Parallel Index Scan Backward using posts_pkey on posts  (cost=0.43..12078793.28 rows=1988522 width=8)
                           Filter: (raw <> ''::text)
                     ->  Index Only Scan using index_topics_on_id_and_deleted_at on topics  (cost=0.43..4.39 rows=1 width=4)
                           Index Cond: ((id = posts.topic_id) AND (id IS NOT NULL) AND (deleted_at IS NULL))
         ->  Materialize  (cost=1000.00..495214.55 rows=102 width=4)
               ->  Gather  (cost=1000.00..495214.04 rows=102 width=4)
                     Workers Planned: 2
                     ->  Parallel Seq Scan on post_search_data pd  (cost=0.00..494203.84 rows=42 width=4)
                           Filter: (((locale)::text = 'en'::text) AND (version = 2))
1 Like

This query will get posts who don’t have search data, so we can fix those.

It runs in less than 1 second in Meta, using a very weak database.

Why do you have so many posts without search data? Maybe an import left you in a bad state?

1 Like

I’m not aware that we do - we have never had problems with search. How can I see if a search is missing search data?

That is possible, but again: we’ve been using Discourse after migrating from vBulletin for almost a year now without a single problem with search. This issue started after I updated today.

1 Like

I ran the previous version of this query, which only took about 1 second on my system too:

SELECT p2.id
FROM posts p2
LEFT JOIN post_search_data pd ON pd.locale = 'en' 
AND pd.version = 2 
AND p2.id = pd.post_id
WHERE pd.post_id IS NULL

This returned 3139 rows.

1 Like

Hmmm, the query was changed to be more restrict, and return less stuff to re-index.

Can you run an EXPLAIN ANALYZE of the query?

2 Likes

I’m running it now, but it’s already taking several minutes - is that expected?

2 Likes

Oh yeah, it may not work in your current state.

Cut down the limit, like:

explain analyze  SELECT
          posts.id
        FROM posts
        LEFT JOIN post_search_data pd
          ON pd.locale = 'en'
          AND pd.version = 2
          AND pd.post_id = posts.id
        LEFT JOIN topics ON topics.id = posts.topic_id
        WHERE pd.post_id IS NULL
        AND topics.id IS NOT NULL
        AND topics.deleted_at IS NULL
        AND posts.raw != ''
        ORDER BY posts.id DESC
        LIMIT 200;
1 Like

Great, that finished very fast:

sudo -u postgres psql discourse -c "EXPLAIN ANALYZE SELECT posts.id FROM posts LEFT JOIN post_search_data pd ON pd.locale='en' AND pd.version=2 AND pd.post_id=posts.id LEFT JOIN topics ON topics.id=posts.topic_id WHERE pd.post_id IS NULL AND topics.id IS NOT NULL AND topics.deleted_at IS NULL AND posts.raw!='' ORDER BY posts.id DESC LIMIT 200;";
                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.29..2470.34 rows=200 width=4) (actual time=2.318..46.698 rows=200 loops=1)
   ->  Nested Loop Anti Join  (cost=1.29..58896117.57 rows=4770758 width=4) (actual time=2.317..46.664 rows=200 loops=1)
         ->  Nested Loop  (cost=0.86..33044192.61 rows=4770860 width=4) (actual time=0.132..39.384 rows=2238 loops=1)
               ->  Index Scan Backward using posts_pkey on posts  (cost=0.43..12113605.04 rows=4772452 width=8) (actual time=0.075..27.719 rows=2239 loops=1)
                     Filter: (raw <> ''::text)
                     Rows Removed by Filter: 9
               ->  Index Only Scan using index_topics_on_id_and_deleted_at on topics  (cost=0.43..4.39 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=2239)
                     Index Cond: ((id = posts.topic_id) AND (id IS NOT NULL) AND (deleted_at IS NULL))
                     Heap Fetches: 2106
         ->  Index Scan using posts_search_pkey on post_search_data pd  (cost=0.43..5.42 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=2238)
               Index Cond: (post_id = posts.id)
               Filter: (((locale)::text = 'en'::text) AND (version = 2))
               Rows Removed by Filter: 0
 Planning time: 1.923 ms
 Execution time: 46.760 ms
(15 rows)
1 Like

That’s strange.

Can you push that limit to 500? Maybe 1000?

Also, you can kill the old running queries and run a vacuum over posts, topics and post_search_data ?

2 Likes

Once I push it to 500 it’s taking a very long time again (still not finished).

I need a little guidance on killing the processes and doing the vacuum, sorry :slight_smile:

1 Like

The OP has the “stuck” queries with the PIDs of each one.

You can use SELECT pg_cancel_backend(__pid__); to ask the query to stop nicely, and SELECT pg_terminate_backend(__pid__); to ask it to stop NOW. (where __pid__ is a number)

For vacuum:

VACUUM VERBOSE ANALYZE posts;
VACUUM VERBOSE ANALYZE topics;
VACUUM VERBOSE ANALYZE  post_search_data;
3 Likes

Ok done, all the dusty corners of my DB have been vacuumed!

The EXPLAIN ANALYZE was a TON faster now, I could even run it with the 20,000 limit. Perhaps the vacuum did the trick?

sudo -u postgres psql discourse -c "EXPLAIN ANALYZE SELECT posts.id FROM posts LEFT JOIN post_search_data pd ON pd.locale='en' AND pd.version=2 AND pd.post_id=posts.id LEFT JOIN topics ON topics.id=posts.topic_id WHERE pd.post_id IS NULL AND topics.id IS NOT NULL AND topics.deleted_at IS NULL AND posts.raw!='' ORDER BY posts.id DESC LIMIT 20000;";
                                                                             QUERY PLAN                                                                              
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1440033.74..1440033.75 rows=1 width=4) (actual time=4549.789..4550.293 rows=3077 loops=1)
   ->  Sort  (cost=1440033.74..1440033.75 rows=1 width=4) (actual time=4549.788..4550.104 rows=3077 loops=1)
         Sort Key: posts.id DESC
         Sort Method: quicksort  Memory: 241kB
         ->  Nested Loop  (cost=599461.35..1440033.73 rows=1 width=4) (actual time=2535.778..4548.345 rows=3077 loops=1)
               ->  Gather  (cost=599460.92..1440030.82 rows=1 width=8) (actual time=2535.705..4734.830 rows=3118 loops=1)
                     Workers Planned: 2
                     Workers Launched: 2
                     ->  Hash Anti Join  (cost=598460.92..1439030.72 rows=1 width=8) (actual time=2551.745..4510.171 rows=1039 loops=3)
                           Hash Cond: (posts.id = pd.post_id)
                           ->  Parallel Seq Scan on posts  (cost=0.00..798106.15 rows=1718754 width=8) (actual time=0.022..810.363 rows=1377851 loops=3)
                                 Filter: (raw <> ''::text)
                                 Rows Removed by Filter: 169
                           ->  Hash  (cost=530651.93..530651.93 rows=4133119 width=4) (actual time=2535.564..2535.564 rows=4130922 loops=3)
                                 Buckets: 1048576  Batches: 8  Memory Usage: 26319kB
                                 ->  Seq Scan on post_search_data pd  (cost=0.00..530651.93 rows=4133119 width=4) (actual time=0.037..1846.142 rows=4130922 loops=3)
                                       Filter: (((locale)::text = 'en'::text) AND (version = 2))
                                       Rows Removed by Filter: 3104
               ->  Index Only Scan using index_topics_on_id_and_deleted_at on topics  (cost=0.43..2.91 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=3118)
                     Index Cond: ((id = posts.topic_id) AND (id IS NOT NULL) AND (deleted_at IS NULL))
                     Heap Fetches: 709
 Planning time: 2.687 ms
 Execution time: 4750.903 ms
(23 rows)
1 Like

EXPLAIN ANALYZE actually runs the query, so you can for sure just run it now :wink:.

Looks like the poor database was missing some care. :face_with_thermometer:

3 Likes

Interesting - I didn’t realise I needed to do any of those actions as a Discourse admin. Is there a HOWTO about this, by any chance?

1 Like

PostgreSQL runs vacuum automatically in the background.

In normal usage, we never need to manually go and interfere on that.

However, imports, restoring backups and etc, can generate a lot of cruft in the database, and if the resources aren’t adequate it can generate garbage faster than the autovacuum can clean it out.

6 Likes

That’s good to know. My server isn’t overloaded, so I’d be surprised if I didn’t have enough resources - I usually have a few cores idling, plenty of free RAM and SSD drives. Is there a way to check if postgres is trying to run these vacuum jobs, but fails for some reason?

By the way, I don’t want to keep on taking your time - you’ve been a life saver! So big hug from the Netherlands and don’t worry about my last question here if you want to move on to other things :slight_smile:

4 Likes