Post.calculate_avg_time() taking up a long time

I just caught another occurrence of this.

We have a number of queries being blocked looking like

UPDATE "posts" SET reads = reads + 1 WHERE ("posts"."deleted_at" IS NULL) AND (topic_id = 47 and post_number = 587)

They’re blocking each other, and eventually they are blocked by

 UPDATE posts    
  SET avg_time = (x.gmean / 1000)  
  FROM 
(
SELECT post_timings.topic_id,           
  post_timings.post_number,                    
            round(exp(avg(ln(msecs)))) AS gmean
    FROM post_timings                          
  INNER JOIN posts AS p2                       
   ON p2.post_number = post_timings.post_number
   AND p2.topic_id = post_timings.topic_id     
     AND p2.user_id <> post_timings.user_id    
 GROUP BY post_timings.topic_id, post_timings.post_number
 
 ) AS x   
 WHERE (x.topic_id = posts.topic_id   
 AND x.post_number = posts.post_number 
       AND (posts.avg_time <> (x.gmean / 1000)::int OR posts.avg_time IS NULL))

When there are enough posts being read, too many UPDATE querys are being executed and blocked, and all processes get exhausted and the forum becomes non-responsive.

In my experience, the issue here is the UPDATE .... SELECT pattern where the UPDATE blocks during the (long) execution time of the SELECT. I’m not a real Postgres guru but maybe this could be avoided by (pseudo Ruby code)

array = SELECT ...
foreach array AS row
      UPDATE row

?

We can probably re-write this to shift data into a temp table and then use the temp table to update. That would heavily reduce locking.

3 Likes

Yes, it’s all down to the UPDATE ... SELECT pattern – that query is going to be locking a lot of rows as it goes along (probably not all rows, but enough to cause the problems you describe). Normally, I’m not a fan of separate SELECT and UPDATE queries, because it isn’t transactional (or, if you do wrap it in a transaction and use SELECT FOR UPDATE, it’s just as locky and now slower to run because it’s now two queries rather than one), but in this case, since we’re just doing some periodic stats mangling, it isn’t really such a big deal to do it in two steps.

3 Likes

I believe the daily version of the query can be optimized by moving the filter up in the subquery.

Before: kmV3 : Discourse Post.calculate_avg_time daily | explain.depesz.com
After: i3JU : Discourse Post.calculate_avg_time daily Optimized | explain.depesz.com

This would not apply to the weekly one, but as you stated the daily one is the bugger.

Can you try that and confirm the differences @michaeld ?

3 Likes

Can you please post the optimized query, I’m too unfamiliar with the Psql explain output to reconstruct the query from it…

LOL, sorry, I forgot the query :smile:

Before

SELECT *
FROM posts ,

  (SELECT post_timings.topic_id,
          post_timings.post_number,
          round(exp(avg(ln(msecs)))) AS gmean
   FROM post_timings
   INNER JOIN posts AS p2 ON p2.post_number = post_timings.post_number
   AND p2.topic_id = post_timings.topic_id
   AND p2.user_id <> post_timings.user_id
   GROUP BY post_timings.topic_id,
            post_timings.post_number) AS x
WHERE x.topic_id = posts.topic_id
  AND x.post_number = posts.post_number
  AND (posts.avg_time <> (x.gmean / 1000)::int
       OR posts.avg_time IS NULL)
  AND posts.topic_id IN
    (SELECT id
     FROM topics
     WHERE bumped_at > CURRENT_DATE - 2);

After

SELECT *
FROM posts ,
  (SELECT post_timings.topic_id,
          post_timings.post_number,
          round(exp(avg(ln(msecs)))) AS gmean
   FROM post_timings
   INNER JOIN posts AS p2 ON p2.post_number = post_timings.post_number
   AND p2.topic_id = post_timings.topic_id
   AND p2.user_id <> post_timings.user_id
   WHERE p2.topic_id IN
       (SELECT id
        FROM topics
        WHERE bumped_at > CURRENT_DATE - 2)
   GROUP BY post_timings.topic_id,
            post_timings.post_number) AS x
WHERE x.topic_id = posts.topic_id
  AND x.post_number = posts.post_number
  AND (posts.avg_time <> (x.gmean / 1000)::int
       OR posts.avg_time IS NULL);

I have changed from UPDATE to SELECT so I could play around with it.

1 Like

Is this effective in your local testing with a large database?

It made the query go from 22s to 1.4s. After caching.

The first run was much worse, like 210s.

This is affecting 900~ rows.

Also, this is on HDD. SSD may be way less sensitive about this big index scans.

2 Likes

I would much prefer to shuffle the data around, that way there is no locking involved during update.

Still, if this helps, why not give it a shot for now? We’re busy with other stuff.

@Falco I don’t want this lost forever can you do a quick PR to shift the WHERE clause up, I think this is safe enough and will not do any damage as far as I can tell.

3 Likes

The SQL changes suggested here are live now.

7 Likes

Looks like it went from ~ 85s to 35s here in Meta. In another instance (which runs in our beefier bare metal hosting) it went from 1350ms to 350ms.

7 Likes

3 Likes

Is it possible that changes cause performance issue ?
Our instance is running slow, and I had to cancel a query to get discourse running again (but still looks slow).
The query is :

172 | 02:07:54.107363 | UPDATE posts +| active
| | SET avg_time = (x.gmean / 1000) +|
| | FROM (SELECT post_timings.topic_id, +|
| | post_timings.post_number, +|
| | round(exp(avg(CASE WHEN msecs > 0 THEN ln(msecs) ELSE 0 END))) AS gmean +|
| | FROM post_timings +|
| | INNER JOIN posts AS p2 +|
| | ON p2.post_number = post_timings.post_number +|
| | AND p2.topic_id = post_timings.topic_id +|
| | AND p2.user_id <> post_timings.user_id +|
| | WHERE (p2.topic_id IN (SELECT id FROM topics where bumped_at > '2019-05-03 09:39:22.717153'))+|
| | GROUP BY post_timings.topic_id, post_timings.post_number) AS x +|
| | WHERE (x.topic_id = posts.topic_id +|
| | AND x.post_number = posts.post_number +|
| | AND (posts.avg_time <> (x.gmean / 1000)::int OR posts.avg_time IS NULL)) |

We have a huge database with lots of topics containing 10k+ messages. We are trying to split them but cannot succeed to.

1 Like

My changes affected only the daily query, which operates over posts from the last two days.

3 Likes

I can see the date in the where clause of the query. But since our last discourse update we have performance issue. That query was running for two hours when I cancelled it.
Don’t you think it’s related ? Should I create a new topic in support ?

I will be removing this whole job in the interim, we are getting very low amounts of value out of this data and need to rethink this subsystem.

EDIT:

done per

Closing this as this topic is no longer applicable.

8 Likes