Post.calculate_avg_time() taking up a long time

michaeld · 5 Settembre 2016, 7:32pm

On forums with a lot of posts, we see the daily CalculateAvgTime job taking up a lot of time and even block other processes.

The source code comments already say

# PERF: these calculations can become exceedingly expnsive
#  they run a huge gemoetric mean and are hard to optimise
#  defer to only run once a day

But they still don’t play nice.
Would it hurt if we clean up the post_timings table on a regular basis? Or are we missing something obvious here and is there a better way to solve this?

codinghorror · 5 Settembre 2016, 8:33pm

A question for @sam I think all the recent optimizations for this landed in 1.6 though.

sam · 5 Settembre 2016, 8:35pm

Afaik this gets faster after it finishes the first run, has a complete run ever finished?

michaeld · 5 Settembre 2016, 8:39pm

Yes, it runs every day without errors, it just takes long… On a large forum and pretty decent hardware it takes 2-4 minutes… I don’t see it getting faster (how would that work? Postgres doesn’t optimize thát well for a single query?)

michaeld · 5 Settembre 2016, 8:41pm

BTW the issue is not that it just takes a while, but it tends to block other processes…

codinghorror · 5 Settembre 2016, 8:49pm

How large? How many posts? We’ve seen virtualized cloud database hardware take 3x longer than our bare metal database hardware.

michaeld · 5 Settembre 2016, 8:58pm

I’m on my phone now, will look up the numbers later. I estimate 1M posts and 80M post_timings.

EDIT my memory served me right… 1.24M posts and 85.8M post_timings

But to be honest, I don’t think 3x matters that much. This looks like an O(n*m) complexity query… it will always cause trouble eventually… I’m more looking for a way to make it O(1) - or to stop causing it to block the posts table.

mpalmer · 5 Settembre 2016, 11:08pm

When you say it’s blocking the posts table, which other queries, exactly, is it blocking? I’m not particularly familiar with the query under discussion, but depending on which queries are being blocked, we might be able to make it less intrusive (by not taking write locks, or checkpointing, or something).

Mittineague · 5 Settembre 2016, 11:21pm

I too am unfamiliar with this. My dev instances have a low number of posts so this problem has not surfaced.

If the code is determining geometric mean values for every topic I can see how this could get resource intensive.

Maybe a way to include only those topics that have posts edited or added since the last calculation into the mix would help?

michaeld · 6 Settembre 2016, 10:12pm

I just caught another occurrence of this.

We have a number of queries being blocked looking like

UPDATE "posts" SET reads = reads + 1 WHERE ("posts"."deleted_at" IS NULL) AND (topic_id = 47 and post_number = 587)

They’re blocking each other, and eventually they are blocked by

 UPDATE posts    
  SET avg_time = (x.gmean / 1000)  
  FROM 
(
SELECT post_timings.topic_id,           
  post_timings.post_number,                    
            round(exp(avg(ln(msecs)))) AS gmean
    FROM post_timings                          
  INNER JOIN posts AS p2                       
   ON p2.post_number = post_timings.post_number
   AND p2.topic_id = post_timings.topic_id     
     AND p2.user_id <> post_timings.user_id    
 GROUP BY post_timings.topic_id, post_timings.post_number
 
 ) AS x   
 WHERE (x.topic_id = posts.topic_id   
 AND x.post_number = posts.post_number 
       AND (posts.avg_time <> (x.gmean / 1000)::int OR posts.avg_time IS NULL))

When there are enough posts being read, too many UPDATE querys are being executed and blocked, and all processes get exhausted and the forum becomes non-responsive.

In my experience, the issue here is the UPDATE .... SELECT pattern where the UPDATE blocks during the (long) execution time of the SELECT. I’m not a real Postgres guru but maybe this could be avoided by (pseudo Ruby code)

array = SELECT ...
foreach array AS row
      UPDATE row

?

sam · 6 Settembre 2016, 10:17pm

We can probably re-write this to shift data into a temp table and then use the temp table to update. That would heavily reduce locking.

mpalmer · 6 Settembre 2016, 10:20pm

Yes, it’s all down to the UPDATE ... SELECT pattern – that query is going to be locking a lot of rows as it goes along (probably not all rows, but enough to cause the problems you describe). Normally, I’m not a fan of separate SELECT and UPDATE queries, because it isn’t transactional (or, if you do wrap it in a transaction and use SELECT FOR UPDATE, it’s just as locky and now slower to run because it’s now two queries rather than one), but in this case, since we’re just doing some periodic stats mangling, it isn’t really such a big deal to do it in two steps.

Falco · 12 Settembre 2016, 9:07pm

I believe the daily version of the query can be optimized by moving the filter up in the subquery.

Before: kmV3 : Discourse Post.calculate_avg_time daily | explain.depesz.com
After: i3JU : Discourse Post.calculate_avg_time daily Optimized | explain.depesz.com

This would not apply to the weekly one, but as you stated the daily one is the bugger.

Can you try that and confirm the differences @michaeld ?

michaeld · 12 Settembre 2016, 9:25pm

Can you please post the optimized query, I’m too unfamiliar with the Psql explain output to reconstruct the query from it…

Falco · 12 Settembre 2016, 9:49pm

LOL, sorry, I forgot the query

Before

SELECT *
FROM posts ,

  (SELECT post_timings.topic_id,
          post_timings.post_number,
          round(exp(avg(ln(msecs)))) AS gmean
   FROM post_timings
   INNER JOIN posts AS p2 ON p2.post_number = post_timings.post_number
   AND p2.topic_id = post_timings.topic_id
   AND p2.user_id <> post_timings.user_id
   GROUP BY post_timings.topic_id,
            post_timings.post_number) AS x
WHERE x.topic_id = posts.topic_id
  AND x.post_number = posts.post_number
  AND (posts.avg_time <> (x.gmean / 1000)::int
       OR posts.avg_time IS NULL)
  AND posts.topic_id IN
    (SELECT id
     FROM topics
     WHERE bumped_at > CURRENT_DATE - 2);

After

SELECT *
FROM posts ,
  (SELECT post_timings.topic_id,
          post_timings.post_number,
          round(exp(avg(ln(msecs)))) AS gmean
   FROM post_timings
   INNER JOIN posts AS p2 ON p2.post_number = post_timings.post_number
   AND p2.topic_id = post_timings.topic_id
   AND p2.user_id <> post_timings.user_id
   WHERE p2.topic_id IN
       (SELECT id
        FROM topics
        WHERE bumped_at > CURRENT_DATE - 2)
   GROUP BY post_timings.topic_id,
            post_timings.post_number) AS x
WHERE x.topic_id = posts.topic_id
  AND x.post_number = posts.post_number
  AND (posts.avg_time <> (x.gmean / 1000)::int
       OR posts.avg_time IS NULL);

I have changed from UPDATE to SELECT so I could play around with it.

codinghorror · 12 Settembre 2016, 9:59pm

Is this effective in your local testing with a large database?

Falco · 12 Settembre 2016, 10:03pm

It made the query go from 22s to 1.4s. After caching.

The first run was much worse, like 210s.

This is affecting 900~ rows.

Also, this is on HDD. SSD may be way less sensitive about this big index scans.

sam · 13 Settembre 2016, 12:39am

I would much prefer to shuffle the data around, that way there is no locking involved during update.

codinghorror · 13 Settembre 2016, 12:39am

Still, if this helps, why not give it a shot for now? We’re busy with other stuff.

Argomento		Risposte	Visualizzazioni
PostgreSQL runaway IO Installation	37	2984	Marzo 17, 2021
Long-Running Sidekiq Jobs Feature	21	1626	Dicembre 24, 2020
Slow (over 3 sec) SQL query Hosting	4	1373	Gennaio 20, 2017
Posting is taking multiple seconds Support	23	920	Luglio 27, 2024
Some queries that don't finish Support	18	912	Giugno 18, 2022

Post.calculate_avg_time() taking up a long time

Argomenti correlati