PostgreSQL runaway IO

Surprising that you are running into this with 17k posts in one topic, I view 20k as the “starting to be a problem” cutoff. But maybe we should set the bar lower cc @tgxworld

It could be related to my specific setup though - if it helps i can document a bit more, run some additional tests etc (bonnie++ or so).

We have some public results about disk bench here:

Maybe run those (when the forum has low load) to compare?

Just as a thought exercise: what exactly would break if this specific job would not exist or run?

I can’t find any place in the code where Post.avg_time and Topic.avg_time are actually being used? Or is there maybe some generated accessor that I’m missing?

1 Like

It’s used to compute the post score that we use to rank posts so we can summarize them.


I have mentioned this particular issue before… it is totally unrelated to @tgxworld’s work.

I am still not buying that we are getting any value from “geometric mean of all read times for all posts”… a simple… which posts are read most is probably good enough for the “best of” algorithm.

Plus this whole concept is a bit broken imo cause the further you get to the end of a topic the less people read it.

I think a simple fix here, is to stop doing the whole geometric mean read time thing on topics with more than say 1000 posts. Which is oddly the only time we would use it… :frowning:

Silly idea perhaps: Topics have a certain flow, setup, experience to them. Some topics are by nature filled with short posts, others with long posts.

Would it be an idea to calculate the mean over the first 500 or 1000 posts and use that as an indication for all posts in the thread?

After all, it’s an indication so an approximation would be acceptable, no?

I don’t even know if this would help in our case. Fact is that post_timings is bigger than the internal memory of the server (20 GB vs 16 GB) and even limiting to the first x posts would still result in the database failing to pull the entire table into memory.

So my feeling is that attempting to do this query differently would work better than trying to “limit” it.


Hi, I did run a test just now on the server (still relatively active with 30 active users online). These are the results:

19.3 k requests completed in 9.89 s, 75.5 MiB read, 1.95 k iops, 7.63 MiB/s
generated 19.3 k requests in 10.0 s, 75.5 MiB, 1.93 k iops, 7.55 MiB/s

Everything seems to be about 1/7th of the speed you get. Perhaps indeed I should consider switching from old-school HDD to SSDs.


These numbers are terrifyingly bad even for traditional HDDs. Way below what fancy enterprise 10k and 15k drives should be doing.


I’ve seen this before (on the linux-raid mailing list) when legacy partition tables and contents are copied onto modern 4k sector hard drives. (Default legacy partitions are not 4k -aligned.) The 512-byte sector support will generate read-modify-write cycles in the drive firmware as needed, and random access workloads on misaligned blocks will need lots of R-M-W.

On linux-raid, these discoveries are usually associated with raid array rebuild failures and catastrophic data loss. The poor performance is a hint that all is not alright with your storage stack.


You’re right (although in a slightly different way).

First of all, I am an idiot. Second of all, thank you all for the hints and suggestions.

Third and most importantly: I have a new-found hate for raid controllers which disable disk caches.

Enabled the disk caches, got the proper speeds.

Let’s see how big the real-world gains are though.



Sorry to open up this topic again but I have a question on the size of post_timings.

Out of curiousity I had a look at the database to see how big that table really is, so I checked it with:
select count(*) from post_timings;
and that came back with 25 million records, 25.101.376, to be specific.

The total number of posts in the system is 99.260.

Does it make sense that post_timings contains roughly 250 entries per post? Or is there something that went (a bit) wrong somewhere and is there stuff I can do to clean up?


post_timings will contain 1 entry for every user that reads a post. So if you have, on average, 250 users reading each post, then 25 million rows sounds about right.


Hi! Sorry to trouble you again over this but I have -yet again- a related question.

Would it perhaps be possible to calculate the relevant value for a specific topic once the topic is being archived (and then remove all those entries from the post_timings table?).

Some context:
I’ve got a forum here that’s been up for 2.5 months now, and we’re over 100K posts already, with close to 1400 (very active!) users.

Some of these topics tend to be revisited by very many of those users, so I’m afraid the system will eventually come to a grinding halt if i’m not careful.

Some data: 25M records in the post_timings table. The top-10 topics (out of little over 400 topics) are good for 16.7M entries.

What we’re doing now is close topics when they reach 5.000 posts (the most active topic reaches this after around 10 days). And then we open a new topic. As you can imagine, these topics are used for very active social conversations and sometimes more resemble a chat than a forum.

When we close a topic the users move on to the next one (and hardly ever come back to the previous one - it’s basically an archive from then on). So we archive the old topic. I would happily accept it if we were to make one final calculation of the average reading time of the archived topic (so that the other functionality doesn’t break), then remove all entries for the archived topic bar one (which contains the final calculated value) and thereby significantly reduce the size of the table.

We could even do the same thing but per user if needed.

In fact: If the most important reasons for this table are to be able to calculate how much time a specific user has spent reading then perhaps we could make it a configurable option? I know that funnily enough our users are not really interested in that (in fact, it’s a bit of a shocker for some that they spend so much time on the forum!)

Would this work?

Are you actually seeing any issues caused by the number of rows in your post_timings table? Pruning the table might be possible, but would likely have unintended side effects.


Well, I am still seeing issues with the CalculateAvgTime query which doesn’t seem to want to finish. I first thought it was related to reduced performance of my underlying IO storage, but indeed - I moved it to an SSD-based test environment and it still seems to not want to finish.

I have to admit, I’m just trying to see what I can do to improve life here, so any and all advice is welcome. Then again: if this is a case of the ‘sorry, can’t fix’ then I’ll accept that as well.

Btw: for the administrative side of things, would it be possible to setup a read-only slave which does the bulk of the processing for all maintenance bits? Because I could then move that to its own working environment, and not have it impact performance of the main forum.

You absolutely need to prevent megatopics from forming. Any topic approaching 10k replies is an accident waiting to happen. The more you have of these, the worse it will get, and the more replies, the worse the overall effect.

1 Like

We indeed learned from this and will close topics when they grow to 5000 posts.
Question is: what can I do with the one archived 17.0000-post topic?

Can I break it into smaller pieces? Does it help when I delete the associated entries from then post_timings table?

Or should I just throw away that topic altogether?

Edit: I think the size of the post_timings table now also starts to affect the BadgeGrant query which seems to be not too happy.

Update: I’ve taken the plunge and moved a lot of entries into new topics. The biggest topic is now on longer 17K posts, but 7K.

Still, no dice.

The CalculateAvgTime and BadgeGrant now both seem to keep on running forever.

a select count(*) from post_timings; tells me that post_timings is now 29 million records large, and I think that has to do with at least some of the problems.

From the BadgeGrant query I ran the innermost bit by hand:

SELECT pt.user_id                                                         
     FROM post_timings pt                                                      
     JOIN badge_posts b ON b.post_number = pt.post_number AND                  
                           b.topic_id = pt.topic_id                            
     JOIN topics t ON = pt.topic_id                                       
     LEFT JOIN user_badges ub ON ub.badge_id = 17 AND ub.user_id = pt.user_id  
     WHERE IS NULL AND t.posts_count > 100                               
     GROUP BY pt.user_id, pt.topic_id, t.posts_count                           
     HAVING count(*) >= t.posts_count

and indeed I get the same results - taking forever. And with forever I mean I let it run for at least 20 minutes before I interrupted this query.

My biggest concern is that this will only worsen due to the fact that this is a rather active forum and we’ve only been underway for 2.5 months.

Is there any suggestion you have that could alleviate this situation?