PostgreSQL runaway IO

They are SAS disks, 900GB 15K (spinning disks, not SSDs) configured in RAID10. They are not exactly slow, but you obviously wouldn’t get the IOPS you’d get from an SSD.

As for the database: the postgres_data directory is 4.5GB.

I doubt it’s redis in this case though - I stupidly enough forgot to grab a full screenshot from iotop but I can’t remember seeing redis in there. Did see postgres UPDATE at the top of the list with IO activity of 98% and up.

BTW: I’ll check tomorrow as I have no doubt it will happen again. And I’ll grab a screenshot from iotop.

I still don’t feel redis should be constantly clobbering the disk. It’s supposed to be an in memory key-value store most of all.

Hi, server’s very busy again - changed the timezone within the container from UTC to GMT+1 (Europe/Amsterdam) and indeed, load now starts 2 hours later.

Screenshot from iotop -o (which only lists the actual tasks doing I/O):

I’ve observed the behaviour and indeed, over the last 5 minutes, I did see Redis pop up twice, but only for a moment. Doesn’t come across as being the source of the problems to me.

I have a feeling there are some scheduled tasks which are being started, but I can’t really find the source. Checked cron jobs from both within the container and the underlying OS - no dice.

If you have any suggestions then those would be more than welcome! Also, if there’s more I can do regarding information gathering, please tell me! :slight_smile:

1 Like

More likely Postgres then, if you have ruled Redis out.

I think so too - see the screenshot of the output of iostat - it’s a PostgreSQL UPDATE command that generates most of the IO load.

Thing is: what is triggered here? And why? Where can I find how (and when!) the postgresql maintenance tasks are triggered?

Edit: Additional info: Screenshot from sidekiq:

Looks like your problem is this one: Reschedule sidekiq scheduled job

You have one still active topic (last reply 19m ago) with 17k posts. That plus slow disks kill performance. The fact that your data is larger than the RAM also makes this more complicated.


That’s the most popular topic indeed. Would it be better to break it up in pieces, close this one and open a new one?

Thanks for the pointer btw - I’ll have a look!

Yes. I set auto close topics post count to 1024 or 512 when I had an HDD instance.

Also, I completely forgot, but looks like I took an stab at this query 2 years ago:


Surprising that you are running into this with 17k posts in one topic, I view 20k as the “starting to be a problem” cutoff. But maybe we should set the bar lower cc @tgxworld

It could be related to my specific setup though - if it helps i can document a bit more, run some additional tests etc (bonnie++ or so).

We have some public results about disk bench here:

Maybe run those (when the forum has low load) to compare?

Just as a thought exercise: what exactly would break if this specific job would not exist or run?

I can’t find any place in the code where Post.avg_time and Topic.avg_time are actually being used? Or is there maybe some generated accessor that I’m missing?

1 Like

It’s used to compute the post score that we use to rank posts so we can summarize them.


I have mentioned this particular issue before… it is totally unrelated to @tgxworld’s work.

I am still not buying that we are getting any value from “geometric mean of all read times for all posts”… a simple… which posts are read most is probably good enough for the “best of” algorithm.

Plus this whole concept is a bit broken imo cause the further you get to the end of a topic the less people read it.

I think a simple fix here, is to stop doing the whole geometric mean read time thing on topics with more than say 1000 posts. Which is oddly the only time we would use it… :frowning:

Silly idea perhaps: Topics have a certain flow, setup, experience to them. Some topics are by nature filled with short posts, others with long posts.

Would it be an idea to calculate the mean over the first 500 or 1000 posts and use that as an indication for all posts in the thread?

After all, it’s an indication so an approximation would be acceptable, no?

I don’t even know if this would help in our case. Fact is that post_timings is bigger than the internal memory of the server (20 GB vs 16 GB) and even limiting to the first x posts would still result in the database failing to pull the entire table into memory.

So my feeling is that attempting to do this query differently would work better than trying to “limit” it.


Hi, I did run a test just now on the server (still relatively active with 30 active users online). These are the results:

19.3 k requests completed in 9.89 s, 75.5 MiB read, 1.95 k iops, 7.63 MiB/s
generated 19.3 k requests in 10.0 s, 75.5 MiB, 1.93 k iops, 7.55 MiB/s

Everything seems to be about 1/7th of the speed you get. Perhaps indeed I should consider switching from old-school HDD to SSDs.


These numbers are terrifyingly bad even for traditional HDDs. Way below what fancy enterprise 10k and 15k drives should be doing.


I’ve seen this before (on the linux-raid mailing list) when legacy partition tables and contents are copied onto modern 4k sector hard drives. (Default legacy partitions are not 4k -aligned.) The 512-byte sector support will generate read-modify-write cycles in the drive firmware as needed, and random access workloads on misaligned blocks will need lots of R-M-W.

On linux-raid, these discoveries are usually associated with raid array rebuild failures and catastrophic data loss. The poor performance is a hint that all is not alright with your storage stack.


You’re right (although in a slightly different way).

First of all, I am an idiot. Second of all, thank you all for the hints and suggestions.

Third and most importantly: I have a new-found hate for raid controllers which disable disk caches.

Enabled the disk caches, got the proper speeds.

Let’s see how big the real-world gains are though.