On forums with a lot of posts, we see the daily CalculateAvgTime job taking up a lot of time and even block other processes.
The source code comments already say
# PERF: these calculations can become exceedingly expnsive
# they run a huge gemoetric mean and are hard to optimise
# defer to only run once a day
But they still don’t play nice.
Would it hurt if we clean up the post_timings table on a regular basis? Or are we missing something obvious here and is there a better way to solve this?
Yes, it runs every day without errors, it just takes long… On a large forum and pretty decent hardware it takes 2-4 minutes… I don’t see it getting faster (how would that work? Postgres doesn’t optimize thát well for a single query?)
I’m on my phone now, will look up the numbers later. I estimate 1M posts and 80M post_timings.
EDIT my memory served me right… 1.24M posts and 85.8M post_timings
But to be honest, I don’t think 3x matters that much. This looks like an O(n*m) complexity query… it will always cause trouble eventually… I’m more looking for a way to make it O(1) - or to stop causing it to block the posts table.
When you say it’s blocking the posts table, which other queries, exactly, is it blocking? I’m not particularly familiar with the query under discussion, but depending on which queries are being blocked, we might be able to make it less intrusive (by not taking write locks, or checkpointing, or something).
We have a number of queries being blocked looking like
UPDATE "posts" SET reads = reads + 1 WHERE ("posts"."deleted_at" IS NULL) AND (topic_id = 47 and post_number = 587)
They’re blocking each other, and eventually they are blocked by
UPDATE posts
SET avg_time = (x.gmean / 1000)
FROM
(
SELECT post_timings.topic_id,
post_timings.post_number,
round(exp(avg(ln(msecs)))) AS gmean
FROM post_timings
INNER JOIN posts AS p2
ON p2.post_number = post_timings.post_number
AND p2.topic_id = post_timings.topic_id
AND p2.user_id <> post_timings.user_id
GROUP BY post_timings.topic_id, post_timings.post_number
) AS x
WHERE (x.topic_id = posts.topic_id
AND x.post_number = posts.post_number
AND (posts.avg_time <> (x.gmean / 1000)::int OR posts.avg_time IS NULL))
When there are enough posts being read, too many UPDATE querys are being executed and blocked, and all processes get exhausted and the forum becomes non-responsive.
In my experience, the issue here is the UPDATE .... SELECT pattern where the UPDATE blocks during the (long) execution time of the SELECT. I’m not a real Postgres guru but maybe this could be avoided by (pseudo Ruby code)
array = SELECT ...
foreach array AS row
UPDATE row
Yes, it’s all down to the UPDATE ... SELECT pattern – that query is going to be locking a lot of rows as it goes along (probably not all rows, but enough to cause the problems you describe). Normally, I’m not a fan of separate SELECT and UPDATE queries, because it isn’t transactional (or, if you do wrap it in a transaction and use SELECT FOR UPDATE, it’s just as locky and now slower to run because it’s now two queries rather than one), but in this case, since we’re just doing some periodic stats mangling, it isn’t really such a big deal to do it in two steps.
SELECT *
FROM posts ,
(SELECT post_timings.topic_id,
post_timings.post_number,
round(exp(avg(ln(msecs)))) AS gmean
FROM post_timings
INNER JOIN posts AS p2 ON p2.post_number = post_timings.post_number
AND p2.topic_id = post_timings.topic_id
AND p2.user_id <> post_timings.user_id
GROUP BY post_timings.topic_id,
post_timings.post_number) AS x
WHERE x.topic_id = posts.topic_id
AND x.post_number = posts.post_number
AND (posts.avg_time <> (x.gmean / 1000)::int
OR posts.avg_time IS NULL)
AND posts.topic_id IN
(SELECT id
FROM topics
WHERE bumped_at > CURRENT_DATE - 2);
After
SELECT *
FROM posts ,
(SELECT post_timings.topic_id,
post_timings.post_number,
round(exp(avg(ln(msecs)))) AS gmean
FROM post_timings
INNER JOIN posts AS p2 ON p2.post_number = post_timings.post_number
AND p2.topic_id = post_timings.topic_id
AND p2.user_id <> post_timings.user_id
WHERE p2.topic_id IN
(SELECT id
FROM topics
WHERE bumped_at > CURRENT_DATE - 2)
GROUP BY post_timings.topic_id,
post_timings.post_number) AS x
WHERE x.topic_id = posts.topic_id
AND x.post_number = posts.post_number
AND (posts.avg_time <> (x.gmean / 1000)::int
OR posts.avg_time IS NULL);
I have changed from UPDATE to SELECT so I could play around with it.