Post.calculate_avg_time() taking up a long time


(Michael - DiscourseHosting.com) #1

On forums with a lot of posts, we see the daily CalculateAvgTime job taking up a lot of time and even block other processes.

The source code comments already say

# PERF: these calculations can become exceedingly expnsive
#  they run a huge gemoetric mean and are hard to optimise
#  defer to only run once a day

But they still don’t play nice.
Would it hurt if we clean up the post_timings table on a regular basis? Or are we missing something obvious here and is there a better way to solve this?


(Jeff Atwood) #2

A question for @sam I think all the recent optimizations for this landed in 1.6 though.


(Sam Saffron) #3

Afaik this gets faster after it finishes the first run, has a complete run ever finished?


(Michael - DiscourseHosting.com) #4

Yes, it runs every day without errors, it just takes long… On a large forum and pretty decent hardware it takes 2-4 minutes… I don’t see it getting faster (how would that work? Postgres doesn’t optimize thát well for a single query?)


(Michael - DiscourseHosting.com) #5

BTW the issue is not that it just takes a while, but it tends to block other processes…


(Jeff Atwood) #6

How large? How many posts? We’ve seen virtualized cloud database hardware take 3x longer than our bare metal database hardware.


(Michael - DiscourseHosting.com) #7

I’m on my phone now, will look up the numbers later. I estimate 1M posts and 80M post_timings.

EDIT my memory served me right… 1.24M posts and 85.8M post_timings

But to be honest, I don’t think 3x matters that much. This looks like an O(n*m) complexity query… it will always cause trouble eventually… I’m more looking for a way to make it O(1) - or to stop causing it to block the posts table.


(Matt Palmer) #8

When you say it’s blocking the posts table, which other queries, exactly, is it blocking? I’m not particularly familiar with the query under discussion, but depending on which queries are being blocked, we might be able to make it less intrusive (by not taking write locks, or checkpointing, or something).


(Mittineague) #9

I too am unfamiliar with this. My dev instances have a low number of posts so this problem has not surfaced.

If the code is determining geometric mean values for every topic I can see how this could get resource intensive.

Maybe a way to include only those topics that have posts edited or added since the last calculation into the mix would help?


(Michael - DiscourseHosting.com) #10

I just caught another occurrence of this.

We have a number of queries being blocked looking like

UPDATE "posts" SET reads = reads + 1 WHERE ("posts"."deleted_at" IS NULL) AND (topic_id = 47 and post_number = 587)

They’re blocking each other, and eventually they are blocked by

 UPDATE posts    
  SET avg_time = (x.gmean / 1000)  
  FROM 
(
SELECT post_timings.topic_id,           
  post_timings.post_number,                    
            round(exp(avg(ln(msecs)))) AS gmean
    FROM post_timings                          
  INNER JOIN posts AS p2                       
   ON p2.post_number = post_timings.post_number
   AND p2.topic_id = post_timings.topic_id     
     AND p2.user_id <> post_timings.user_id    
 GROUP BY post_timings.topic_id, post_timings.post_number
 
 ) AS x   
 WHERE (x.topic_id = posts.topic_id   
 AND x.post_number = posts.post_number 
       AND (posts.avg_time <> (x.gmean / 1000)::int OR posts.avg_time IS NULL))

When there are enough posts being read, too many UPDATE querys are being executed and blocked, and all processes get exhausted and the forum becomes non-responsive.

In my experience, the issue here is the UPDATE .... SELECT pattern where the UPDATE blocks during the (long) execution time of the SELECT. I’m not a real Postgres guru but maybe this could be avoided by (pseudo Ruby code)

array = SELECT ...
foreach array AS row
      UPDATE row

?


(Sam Saffron) #11

We can probably re-write this to shift data into a temp table and then use the temp table to update. That would heavily reduce locking.


(Matt Palmer) #12

Yes, it’s all down to the UPDATE ... SELECT pattern – that query is going to be locking a lot of rows as it goes along (probably not all rows, but enough to cause the problems you describe). Normally, I’m not a fan of separate SELECT and UPDATE queries, because it isn’t transactional (or, if you do wrap it in a transaction and use SELECT FOR UPDATE, it’s just as locky and now slower to run because it’s now two queries rather than one), but in this case, since we’re just doing some periodic stats mangling, it isn’t really such a big deal to do it in two steps.


(Rafael dos Santos Silva) #13

I believe the daily version of the query can be optimized by moving the filter up in the subquery.

Before: kmV3 : Discourse Post.calculate_avg_time daily | explain.depesz.com
After: i3JU : Discourse Post.calculate_avg_time daily Optimized | explain.depesz.com

This would not apply to the weekly one, but as you stated the daily one is the bugger.

Can you try that and confirm the differences @michaeld ?


PostgreSQL runaway IO
(Michael - DiscourseHosting.com) #14

Can you please post the optimized query, I’m too unfamiliar with the Psql explain output to reconstruct the query from it…


(Rafael dos Santos Silva) #15

LOL, sorry, I forgot the query :smile:

Before

SELECT *
FROM posts ,

  (SELECT post_timings.topic_id,
          post_timings.post_number,
          round(exp(avg(ln(msecs)))) AS gmean
   FROM post_timings
   INNER JOIN posts AS p2 ON p2.post_number = post_timings.post_number
   AND p2.topic_id = post_timings.topic_id
   AND p2.user_id <> post_timings.user_id
   GROUP BY post_timings.topic_id,
            post_timings.post_number) AS x
WHERE x.topic_id = posts.topic_id
  AND x.post_number = posts.post_number
  AND (posts.avg_time <> (x.gmean / 1000)::int
       OR posts.avg_time IS NULL)
  AND posts.topic_id IN
    (SELECT id
     FROM topics
     WHERE bumped_at > CURRENT_DATE - 2);

After

SELECT *
FROM posts ,
  (SELECT post_timings.topic_id,
          post_timings.post_number,
          round(exp(avg(ln(msecs)))) AS gmean
   FROM post_timings
   INNER JOIN posts AS p2 ON p2.post_number = post_timings.post_number
   AND p2.topic_id = post_timings.topic_id
   AND p2.user_id <> post_timings.user_id
   WHERE p2.topic_id IN
       (SELECT id
        FROM topics
        WHERE bumped_at > CURRENT_DATE - 2)
   GROUP BY post_timings.topic_id,
            post_timings.post_number) AS x
WHERE x.topic_id = posts.topic_id
  AND x.post_number = posts.post_number
  AND (posts.avg_time <> (x.gmean / 1000)::int
       OR posts.avg_time IS NULL);

I have changed from UPDATE to SELECT so I could play around with it.


Reschedule sidekiq scheduled job
(Jeff Atwood) #16

Is this effective in your local testing with a large database?


(Rafael dos Santos Silva) #17

It made the query go from 22s to 1.4s. After caching.

The first run was much worse, like 210s.

This is affecting 900~ rows.

Also, this is on HDD. SSD may be way less sensitive about this big index scans.


(Sam Saffron) #18

I would much prefer to shuffle the data around, that way there is no locking involved during update.


(Jeff Atwood) #19

Still, if this helps, why not give it a shot for now? We’re busy with other stuff.