Long-Running Sidekiq Jobs

I have two jobs from Sidekiq that seem to be taking a long time to complete. It looks like the same overall process but there are two jobs listing it in the RUNNING status. These jobs have been running for 6 hours now, and postgres is hanging on the CLI when I attempt to even run an EXPLAIN ANALYZE against the first of these queries.

Any ideas on what would cause these queries to take so long to execute?

3 Likes

Feel free to kill those queries a s let the system retry later.

5 Likes

I had done that earlier, actually. I killed the job to let a reindex finish, so this was it restarting after that.

It looks like this morning the jobs finished sometime overnight and are running again. Current status:

This subquery returns around 13,000 rows on our instance:

SELECT ids.user_id, q.post_id, p3.created_at granted_at                     
                   FROM                                                                              
                   (                                                                                 
                     SELECT p1.user_id, MIN(q1.id) id                                                
                     FROM quoted_posts q1                                                            
                     JOIN badge_posts p1 ON p1.id = q1.post_id                                       
                     JOIN badge_posts p2 ON p2.id = q1.quoted_post_id                                
                     WHERE (TRUE OR ( p1.id IN (-1) ))                                               
                     GROUP BY p1.user_id                                                             
                   ) ids                                                                             
                   JOIN quoted_posts q ON q.id = ids.id                                              
                   JOIN badge_posts p3 ON q.post_id = p3.id

It’s then being LEFT JOINED to the user_badges table, which has 84,000 rows. It seems like something in the last condition WHERE ub.badge_id = 15 AND q.user_id IS NULL is blowing up this query. If I leave off the WHERE clause then it will execute in a reasonable amount of time (20 seconds or so), but if I include even just the WHERE ub.badge_id = 15, then I cannot even get an EXPLAIN to execute against this query in a reasonable amount of time. The EXPLAIN has been hanging for several minutes now with no results. The actual execution of the full query has been running for hours. Is there anything we can do to optimize this query?

2 Likes

From reading topics since last night here on Meta, it seems like no job should be running for 8+ Hours, particularly for a database outsize.

But I’m not sure what we can further do to improve it.

It’s crazy to me they we can’t get EXPLAIN to show anything because it hangs however.

1 Like

Looking at this file: https://github.com/discourse/discourse/blob/master/app/services/badge_granter.rb

There’s the following code that runs this query that is currently stuck. If I change the first join from LEFT JOIN to INNER JOIN the query executes instantly. Is there a reason this needs to be a left join?

    sql = <<~SQL
      DELETE FROM user_badges
        WHERE id IN (
          SELECT ub.id
          FROM user_badges ub
          LEFT JOIN (
            #{badge.query}
          ) q ON q.user_id = ub.user_id
          #{post_clause}
          WHERE ub.badge_id = :id AND q.user_id IS NULL
        )
    SQL
1 Like

@Falco

Is it possible to make these badge queries faster?

Another query that looks bad is this one, I think from the weekly cleanup job:

UPDATE posts                                                                                                                
                   SET percent_rank = X.percent_rank                                                                                           
                   FROM (                                                                                                                      
                     SELECT posts.id, Y.percent_rank                                                                                           
                     FROM posts                                                                                                                
                     JOIN (                                                                                                                    
                       SELECT id, percent_rank()                                                                                               
                                    OVER (PARTITION BY topic_id ORDER BY SCORE DESC) as percent_rank                                           
                       FROM posts                                                                                                              
                      ) Y ON Y.id = posts.id                                                                                                   
                      JOIN topics ON posts.topic_id = topics.id                                                                                
                     WHERE (posts.percent_rank IS NULL OR Y.percent_rank <> posts.percent_rank)                                                
                     LIMIT 20000                                                                                                               
                   ) AS X                                                                                                                      
                   WHERE posts.id = X.id

The explain for this shows it’s trying to sort all 26 million rows in the posts table. I can’t tell what method it will use for this query, but based on the fact that the active wait is “DataFileRead” I think it is going out to disk for something…

                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Update on posts  (cost=8312704.61..8627308.35 rows=20000 width=825)
   ->  Nested Loop  (cost=8312704.61..8627308.35 rows=20000 width=825)
         ->  Subquery Scan on x  (cost=8312704.18..8464468.35 rows=20000 width=48)
               ->  Limit  (cost=8312704.18..8464268.35 rows=20000 width=12)
                     ->  Hash Join  (cost=8312704.18..209445240.14 rows=26540908 width=12)
                           Hash Cond: (posts_1.topic_id = topics.id)
                           ->  Nested Loop  (cost=8277347.60..209340213.36 rows=26540908 width=16)
                                 ->  WindowAgg  (cost=8277347.16..8809352.84 rows=26600284 width=24)
                                       ->  Sort  (cost=8277347.16..8343847.87 rows=26600284 width=16)
                                             Sort Key: posts_2.topic_id, posts_2.score DESC
                                             ->  Seq Scan on posts posts_2  (cost=0.00..4542277.84 rows=26600284 width=16)
                                 ->  Index Scan using posts_pkey on posts posts_1  (cost=0.44..7.52 rows=1 width=16)
                                       Index Cond: (id = posts_2.id)
                                       Filter: ((percent_rank IS NULL) OR ((percent_rank() OVER (?)) <> percent_rank))
                           ->  Hash  (cost=23871.05..23871.05 rows=918842 width=4)
                                 ->  Index Only Scan using topics_pkey on topics  (cost=0.42..23871.05 rows=918842 width=4)
         ->  Index Scan using posts_pkey on posts  (cost=0.44..8.14 rows=1 width=781)
               Index Cond: (id = x.id)
 JIT:
   Functions: 24
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(21 rows)

I have a strong feeling that you are swimming upstream here and the DB you are using simply does not have enough resources to run Discourse.

  • What are the exact specs of the metal (CPU / Hard drive make and model) ?
  • What is the exact spec of the VM ?

These queries are indeed expensive, but we host plenty large forums (eg: About - Straight Dope Message Board 22 million posts) and we are able to run all these queries just fine on that instance.

1 Like

It’s a dedicated server with the following:

AMD Ryzen 7 3800X
64 GB ECC RAM @ 2666 Mhz
2 x 1.2 TB Intel P3600 NVMe SSD (ZFS RAID 1)

The VM running Discourse has been assigned 8 CPU cores and 32 GB RAM.

I believe I have discovered the issue with the first query, or at least a way to instruct the query planner to make the right decision. Here’s the query that would not complete in 16+ hours (this is for the First Quote badge):

SELECT ub.id                                                                  
                       FROM user_badges ub                                                           
                       LEFT JOIN (                                                                   
                         SELECT ids.user_id, q.post_id, p3.created_at granted_at                     
                   FROM                                                                              
                   (                                                                                 
                     SELECT p1.user_id, MIN(q1.id) id                                                
                     FROM quoted_posts q1                                                            
                     JOIN badge_posts p1 ON p1.id = q1.post_id                                       
                     JOIN badge_posts p2 ON p2.id = q1.quoted_post_id                                
                     WHERE (TRUE OR ( p1.id IN (-1) ))                                               
                     GROUP BY p1.user_id                                                             
                   ) ids                                                                             
                   JOIN quoted_posts q ON q.id = ids.id                                              
                   JOIN badge_posts p3 ON q.post_id = p3.id                                          
                                                                                                     
                       ) q ON q.user_id = ub.user_id                                                 
                       AND (q.post_id = ub.post_id OR NOT TRUE)                                      
                       WHERE ub.badge_id = 15 AND q.user_id IS NULL

If I add a single ORDER BY line in the right location, this query now completes in a few seconds:

SELECT ub.id                                                                  
                       FROM user_badges ub                                                           
                       LEFT JOIN (                                                                   
                         SELECT ids.user_id, q.post_id, p3.created_at granted_at                     
                   FROM                                                                              
                   (                                                                                 
                     SELECT p1.user_id, MIN(q1.id) id                                                
                     FROM quoted_posts q1                                                            
                     JOIN badge_posts p1 ON p1.id = q1.post_id                                       
                     JOIN badge_posts p2 ON p2.id = q1.quoted_post_id                                
                     WHERE (TRUE OR ( p1.id IN (-1) ))                                               
                     GROUP BY p1.user_id                                                             
                   ) ids                                                                             
                   JOIN quoted_posts q ON q.id = ids.id                                              
                   JOIN badge_posts p3 ON q.post_id = p3.id
                   ORDER BY ids.user_id                                          
                                                                                                     
                       ) q ON q.user_id = ub.user_id                                                 
                       AND (q.post_id = ub.post_id OR NOT TRUE)                                      
                       WHERE ub.badge_id = 15 AND q.user_id IS NULL

I feel like it should be smart enough to do this sort in the right place, but it seems it’s not… Still, the fix seems rather easy at this point.

Haven’t really started digging into the other query on the percent_rank yet.

2 Likes

Some times planning is off when statistics are bad … in some exceptional cases a full vacuum can help, a minimal vacuum is totally recommended post imports. I think you did both.

Is there a reason you are running in a VM vs Docker directly on the host?

1 Like

What are the resources allocated to Straight dope if I may ask and does it have any jobs that takes several hours long like we do at 27 million

Yes, I’ve run VACUUM ANALYZE a few times. The statistics should be correct, but this seems to be choosing poorly across multiple rebuilds, Postgres tuning adjustments, and vacuums.

We’re running other VMs on this host machine but we have the spare resources at the moment so this is where I’ve built a system to test Discourse.

From our large instance looking at: /sidekiq/scheduler

And

Did you do a FULL VACUUM?

Our DB server is ballpark similar hardware performance to yours (though we have faster IO given we have a larger RAID array.) However we do not run virtualized at all. That is one big difference.

2 Likes

I have not. I can give that a try and see if the behavior is any different.

I’m sure there’s some performance loss from running in a VM, but there’s nothing clobbering the hardware heavily. When I ran the import to bring all our data in from our other software, I was able to get 60-70% usage out of all 8 cores running multiple import processes simultaneously.
When these jobs are sitting and spinning now I usually don’t see the load average above 2-3 so they are not even using all the CPU that’s available.

2 Likes

The full vacuum is something I have seen help after massive migrations, curious to see if it has an impact

3 Likes

It is currently running.

2 Likes

On the percent_rank query that looks to be part of the weekly job, what does the EXPLAIN look like for your large instance? My instance is reporting a query cost of over 8 million on that one, which seems a bit scary.
Do you have any recommendations for the postgres tuning values in app.yml? Right now I’m using:

shared_buffers: 16GB
work_mem: 512MB

The VACUUM FULL completed but it doesn’t seem to have made a difference in the query performance. The badge query still seems like it will run for hours and hours unless I add the ORDER BY clause, and the percent_rank query has been running for two hours now without completing. We will probably need to change the SQL definition for the First Quote badge and then I’ll need to look at what can be done to fix the percent_rank query.

Do you have any suggestions for the percent_rank query based on this EXPLAIN?

UPDATE posts                                                                      
                   SET percent_rank = X.percent_rank                                                 
                   FROM (                                                                            
                     SELECT posts.id, Y.percent_rank                                                 
                     FROM posts                                                                      
                     JOIN (                                                                          
                       SELECT id, percent_rank()                                                     
                                    OVER (PARTITION BY topic_id ORDER BY SCORE DESC) as percent_rank 
                       FROM posts                                                                    
                      ) Y ON Y.id = posts.id                                                         
                      JOIN topics ON posts.topic_id = topics.id                                      
                     WHERE (posts.percent_rank IS NULL OR Y.percent_rank <> posts.percent_rank)      
                     LIMIT 20000                                                                     
                   ) AS X                                                                            
                   WHERE posts.id = X.id

                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Update on posts  (cost=6511439.82..6944253.09 rows=20000 width=828)
   ->  Nested Loop  (cost=6511439.82..6944253.09 rows=20000 width=828)
         ->  Subquery Scan on x  (cost=6511439.38..6784765.09 rows=20000 width=48)
               ->  Limit  (cost=6511439.38..6784565.09 rows=20000 width=12)
                     ->  Nested Loop  (cost=6511439.38..374544016.70 rows=26949684 width=12)
                           ->  Nested Loop  (cost=6511438.96..192122439.64 rows=26949684 width=16)
                                 ->  WindowAgg  (cost=6511438.52..7050906.24 rows=26973386 width=24)
                                       ->  Sort  (cost=6511438.52..6578871.98 rows=26973386 width=16)
                                             Sort Key: posts_2.topic_id, posts_2.score DESC
                                             ->  Seq Scan on posts posts_2  (cost=0.00..2721272.86 rows=26973386 width=16)
                                 ->  Index Scan using posts_pkey on posts posts_1  (cost=0.44..6.84 rows=1 width=16)
                                       Index Cond: (id = posts_2.id)
                                       Filter: ((percent_rank IS NULL) OR ((percent_rank() OVER (?)) <> percent_rank))
                           ->  Index Only Scan using topics_pkey on topics  (cost=0.42..6.77 rows=1 width=4)
                                 Index Cond: (id = posts_1.topic_id)
         ->  Index Scan using posts_pkey on posts  (cost=0.44..7.97 rows=1 width=784)
               Index Cond: (id = x.id)
 JIT:
   Functions: 21
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(20 rows)
1 Like

You could try playing with the limit, maybe at limit 1000 this will be fast enough for you.

Changing the limit doesn’t seem to alter the query plan much (costs or otherwise). The issue seems to be that the query has to sort the entire posts table (which in our case is about 26.5 million rows) before it can do the operation. There might an opportunity for an index here. I don’t see the score column included in any of the indexes on the posts table right now.

The ranking is per topic, it is not ranking the whole set.

You could perhaps gate on topic ids … WHERE topic_id < 1000 … 2000 … 10000 and so on… Probably once the initial update is done this will run faster.

1 Like