PostgreSQL 失控的 IO

JanJoost · 2018 年7 月 3 日 21:59

大家好，

我遇到了一个问题，想请大家帮忙。我使用基于 Docker 的安装程序部署了 Discourse 服务器。每天中午左右，服务器会因某种未知原因变得繁忙，IO 负载飙升，导致论坛开始返回 500 错误。

iotop 总是显示是 PostgreSQL 的 UPDATE 操作占用了所有 IO。

所以今天当问题再次发生时，我使用以下命令获取了所有正在运行的 SQL 查询列表：
sudo -u postgres psql discourse -o /tmp/RunningQueries -c 'SELECT datname,pid,state,query FROM pg_stat_activity'

该命令的输出可以在此处查看 Pastebin 链接。

如您所见，在这些时段大约有 32 个 UPDATE 查询正在运行。当这种情况发生时，iotop 显示数据库的读写速度在 2.5MB/秒到 15MB/秒之间。

如果我将正在运行的 UPDATE 查询与日志中记录的查询（位于 /var/discourse/shared/standalone/log/var-log/postgresql）进行关联，我确实发现这些查询耗时非常非常长：

2018-07-03 12:51:27.052 UTC [17504] discourse@discourse LOG:  duration: 2352061.872 ms  statement: UPDATE drafts
                       SET  data = '{"reply":"<redacted for debugging purposes>","action":"reply","categoryId":24,"postId":118034,"archetypeId":"regular","whisper":false,"metaData":null,"composerTime":65922,"typingTime":8400}',

这与以下内容对应：

  discourse | 17504 | active | UPDATE drafts                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
           |       |        |                SET  data = '{"reply":"<redacted for debugging purposes>","action":"reply","categoryId":24,"postId":118034,"archetypeId":"regular","whisper":false,"metaData":null,"composerTime":65922,"typingTime":8400}',                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
           |       |        |                     sequence = 124,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
           |       |        |                     revisions = revisions + 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
           |       |        |                WHERE id = 84548

如果我重启 Docker 应用，这些查询确实会结束，所以基本上我只能等待，这让我的用户很不高兴。

请告诉我有什么办法可以减轻这个问题——例如，将维护任务移至凌晨 5 点左右执行。

如果您有任何进一步深入排查的建议，请分享！任何帮助都非常感激

Ninja 编辑：刚想起一些额外信息：我无法将其与现有的 cron 任务（无论是来自主机操作系统还是 Docker 应用内部）关联起来。

关于服务器的背景信息：

服务器操作系统：Ubuntu 18.04LTS
虚拟机配置：100GB 磁盘，4GB 内存，4 核
磁盘：6 块 15K 转速硬盘组成的 RAID10
Discourse 版本：v2.1.0.beta2 +107
插件：babble, whos-online, voting, cakeday, anonymous-categories, league
日均页面浏览量：约 75,000
用户：1350 人，日均活跃用户约 700 人
帖子总数：73,600，主题数：351，最活跃的主题包含 13,500 篇帖子

感谢大家耐心阅读至此！

Falco · 2018 年7 月 3 日 22:06

Are those spinning disks?
How big is your database?

I had an experience like this with the combo: active forum + spinning disks. In the end it was Redis and PostgreSQL fighting for IO and after a while redis would stop because it had multiple rdb saves in the queue to happen.

Can you please check if redis is saving to disk during the problem?

JanJoost · 2018 年7 月 3 日 22:13

They are SAS disks, 900GB 15K (spinning disks, not SSDs) configured in RAID10. They are not exactly slow, but you obviously wouldn’t get the IOPS you’d get from an SSD.

As for the database: the postgres_data directory is 4.5GB.

I doubt it’s redis in this case though - I stupidly enough forgot to grab a full screenshot from iotop but I can’t remember seeing redis in there. Did see postgres UPDATE at the top of the list with IO activity of 98% and up.

BTW: I’ll check tomorrow as I have no doubt it will happen again. And I’ll grab a screenshot from iotop.

codinghorror · 2018 年7 月 3 日 22:24

I still don’t feel redis should be constantly clobbering the disk. It’s supposed to be an in memory key-value store most of all.

JanJoost · 2018 年7 月 4 日 12:17

Hi, server’s very busy again - changed the timezone within the container from UTC to GMT+1 (Europe/Amsterdam) and indeed, load now starts 2 hours later.

Screenshot from iotop -o (which only lists the actual tasks doing I/O):

I’ve observed the behaviour and indeed, over the last 5 minutes, I did see Redis pop up twice, but only for a moment. Doesn’t come across as being the source of the problems to me.

I have a feeling there are some scheduled tasks which are being started, but I can’t really find the source. Checked cron jobs from both within the container and the underlying OS - no dice.

If you have any suggestions then those would be more than welcome! Also, if there’s more I can do regarding information gathering, please tell me!

codinghorror · 2018 年7 月 4 日 12:35

More likely Postgres then, if you have ruled Redis out.

JanJoost · 2018 年7 月 4 日 12:42

I think so too - see the screenshot of the output of iostat - it’s a PostgreSQL UPDATE command that generates most of the IO load.

Thing is: what is triggered here? And why? Where can I find how (and when!) the postgresql maintenance tasks are triggered?

Edit: Additional info: Screenshot from sidekiq:

Falco · 2018 年7 月 4 日 13:13

Looks like your problem is this one: Reschedule sidekiq scheduled job

You have one still active topic (last reply 19m ago) with 17k posts. That plus slow disks kill performance. The fact that your data is larger than the RAM also makes this more complicated.

JanJoost · 2018 年7 月 4 日 13:16

That’s the most popular topic indeed. Would it be better to break it up in pieces, close this one and open a new one?

Thanks for the pointer btw - I’ll have a look!

Falco · 2018 年7 月 4 日 13:21

Yes. I set auto close topics post count to 1024 or 512 when I had an HDD instance.

Also, I completely forgot, but looks like I took an stab at this query 2 years ago:

codinghorror · 2018 年7 月 4 日 13:27

Surprising that you are running into this with 17k posts in one topic, I view 20k as the “starting to be a problem” cutoff. But maybe we should set the bar lower cc @tgxworld

JanJoost · 2018 年7 月 4 日 13:34

It could be related to my specific setup though - if it helps i can document a bit more, run some additional tests etc (bonnie++ or so).

Falco · 2018 年7 月 4 日 15:15

We have some public results about disk bench here:

Maybe run those (when the forum has low load) to compare?

RGJ · 2018 年7 月 4 日 16:05

Just as a thought exercise: what exactly would break if this specific job would not exist or run?

I can’t find any place in the code where Post.avg_time and Topic.avg_time are actually being used? Or is there maybe some generated accessor that I’m missing?

zogstrip · 2018 年7 月 4 日 16:42

It’s used to compute the post score that we use to rank posts so we can summarize them.

sam · 2018 年7 月 5 日 00:58

I have mentioned this particular issue before… it is totally unrelated to @tgxworld’s work.

I am still not buying that we are getting any value from “geometric mean of all read times for all posts”… a simple… which posts are read most is probably good enough for the “best of” algorithm.

Plus this whole concept is a bit broken imo cause the further you get to the end of a topic the less people read it.

I think a simple fix here, is to stop doing the whole geometric mean read time thing on topics with more than say 1000 posts. Which is oddly the only time we would use it…

JanJoost · 2018 年7 月 5 日 10:13

Silly idea perhaps: Topics have a certain flow, setup, experience to them. Some topics are by nature filled with short posts, others with long posts.

Would it be an idea to calculate the mean over the first 500 or 1000 posts and use that as an indication for all posts in the thread?

After all, it’s an indication so an approximation would be acceptable, no?

RGJ · 2018 年7 月 5 日 15:46

I don’t even know if this would help in our case. Fact is that post_timings is bigger than the internal memory of the server (20 GB vs 16 GB) and even limiting to the first x posts would still result in the database failing to pull the entire table into memory.

So my feeling is that attempting to do this query differently would work better than trying to “limit” it.

JanJoost · 2018 年7 月 5 日 22:56

Hi, I did run a test just now on the server (still relatively active with 30 active users online). These are the results:

19.3 k requests completed in 9.89 s, 75.5 MiB read, 1.95 k iops, 7.63 MiB/s
generated 19.3 k requests in 10.0 s, 75.5 MiB, 1.93 k iops, 7.55 MiB/s

Everything seems to be about 1/7th of the speed you get. Perhaps indeed I should consider switching from old-school HDD to SSDs.

codinghorror · 2018 年7 月 6 日 01:29

These numbers are terrifyingly bad even for traditional HDDs. Way below what fancy enterprise 10k and 15k drives should be doing.

话题		回复	浏览量
Improving Instance Performance (Megatopics, Database Size and Extreme Load) Self-hosting	60	5452	2020 年10 月 13 日
Long loading times for user summary page with slow database Self-hosting unsupported-install	17	1960	2019 年10 月 29 日
Long-Running Sidekiq Jobs Feature	21	1865	2020 年12 月 24 日
Post.calculate_avg_time() taking up a long time Support	26	2920	2019 年5 月 6 日
Unusually high CPU usage Self-hosting	31	999	2026 年2 月 18 日

PostgreSQL 失控的 IO

相关话题