PostgreSQL のIO暴走

こんにちは、皆さん。

Discourse の Docker ベースのインストーラーを使用してサーバーをインストールしているのですが、ある問題が発生しており、お手伝いいただきたいです。毎日正午頃、理由が不明なままサーバーが混雑し、I/O 負荷が上昇して、フォーラムでエラー 500 が表示されるようになります。

iotop を見ると、常に PostgreSQL の UPDATE クエリがすべての I/O を消費していることがわかります。

そこで今日、再び問題が発生した際に、以下のコマンドで実行中のすべての SQL クエリのリストを取得しました:
sudo -u postgres psql discourse -o /tmp/RunningQueries -c ‘SELECT datname,pid,state,query FROM pg_stat_activity’

このコマンドの出力は Pastebin で確認できます

ご覧の通り、この時間帯には約 32 の UPDATE クエリが実行されています。この際、iotop はデータベースの読み書きが 2.5MB/秒から 15MB/秒の範囲で行われていると報告しています。

実行中の UPDATE クエリとログファイル(/var/discourse/shared/standalone/log/var-log/postgresql)に記録されたクエリを照合すると、これらのクエリが非常に長い時間を要していることが確認できます:

2018-07-03 12:51:27.052 UTC [17504] discourse@discourse LOG:  duration: 2352061.872 ms  statement: UPDATE drafts
                       SET  data = '{"reply":"<redacted for debugging purposes>","action":"reply","categoryId":24,"postId":118034,"archetypeId":"regular","whisper":false,"metaData":null,"composerTime":65922,"typingTime":8400}',

これは以下のクエリと対応しています:

  discourse | 17504 | active | UPDATE drafts                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
           |       |        |                SET  data = '{"reply":"<redacted for debugging purposes>","action":"reply","categoryId":24,"postId":118034,"archetypeId":"regular","whisper":false,"metaData":null,"composerTime":65922,"typingTime":8400}',                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
           |       |        |                     sequence = 124,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
           |       |        |                     revisions = revisions + 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +
           |       |        |                WHERE id = 84548

Docker アプリを再起動するとクエリは終了しますが、基本的には待つしかなく、これによりユーザーが不満を抱くことになります。

この状況を改善するために何かできることがあれば教えてください。例えば、メンテナンスタスクを午前 5 時頃などに移動させるなどです。

さらに深く調査するためのご提案があれば、ぜひ共有してください!ご支援を心よりお待ちしております :slight_smile:

ニンジャ編集:追加情報を思い出しました。既存の cronjob(ホスト OS 内または Docker アプリ内からのもの)とは関連付けられませんでした。

サーバーの背景情報:

  • サーバー OS: Ubuntu 18.04LTS
  • VM 仕様:ディスク 100GB、RAM 4GB、コア数 4
  • ディスク:6 台の 15K ディスクによる RAID10
  • Discourse バージョン:v2.1.0.beta2 +107
  • プラグイン:babble, whos-online, voting, cakeday, anonymous-categories, league
  • 1 日のページビュー:約 75,000
  • ユーザー数:1,350(1 日のアクティブユーザー約 700)
  • ポスト数:73,600、トピック数:351、最も活発なトピックのポスト数:13,500

ここまでお読みいただき、ありがとうございます! :slight_smile:

  • Are those spinning disks?

  • How big is your database?

I had an experience like this with the combo: active forum + spinning disks. In the end it was Redis and PostgreSQL fighting for IO and after a while redis would stop because it had multiple rdb saves in the queue to happen.

Can you please check if redis is saving to disk during the problem?

「いいね!」 3

They are SAS disks, 900GB 15K (spinning disks, not SSDs) configured in RAID10. They are not exactly slow, but you obviously wouldn’t get the IOPS you’d get from an SSD.

As for the database: the postgres_data directory is 4.5GB.

I doubt it’s redis in this case though - I stupidly enough forgot to grab a full screenshot from iotop but I can’t remember seeing redis in there. Did see postgres UPDATE at the top of the list with IO activity of 98% and up.

BTW: I’ll check tomorrow as I have no doubt it will happen again. And I’ll grab a screenshot from iotop.

I still don’t feel redis should be constantly clobbering the disk. It’s supposed to be an in memory key-value store most of all.

Hi, server’s very busy again - changed the timezone within the container from UTC to GMT+1 (Europe/Amsterdam) and indeed, load now starts 2 hours later.

Screenshot from iotop -o (which only lists the actual tasks doing I/O):

I’ve observed the behaviour and indeed, over the last 5 minutes, I did see Redis pop up twice, but only for a moment. Doesn’t come across as being the source of the problems to me.

I have a feeling there are some scheduled tasks which are being started, but I can’t really find the source. Checked cron jobs from both within the container and the underlying OS - no dice.

If you have any suggestions then those would be more than welcome! Also, if there’s more I can do regarding information gathering, please tell me! :slight_smile:

「いいね!」 1

More likely Postgres then, if you have ruled Redis out.

I think so too - see the screenshot of the output of iostat - it’s a PostgreSQL UPDATE command that generates most of the IO load.

Thing is: what is triggered here? And why? Where can I find how (and when!) the postgresql maintenance tasks are triggered?

Edit: Additional info: Screenshot from sidekiq:


Looks like your problem is this one: Reschedule sidekiq scheduled job

You have one still active topic (last reply 19m ago) with 17k posts. That plus slow disks kill performance. The fact that your data is larger than the RAM also makes this more complicated.

「いいね!」 5

That’s the most popular topic indeed. Would it be better to break it up in pieces, close this one and open a new one?

Thanks for the pointer btw - I’ll have a look!

Yes. I set auto close topics post count to 1024 or 512 when I had an HDD instance.

Also, I completely forgot, but looks like I took an stab at this query 2 years ago:

「いいね!」 5

Surprising that you are running into this with 17k posts in one topic, I view 20k as the “starting to be a problem” cutoff. But maybe we should set the bar lower cc @tgxworld

It could be related to my specific setup though - if it helps i can document a bit more, run some additional tests etc (bonnie++ or so).

We have some public results about disk bench here:

Maybe run those (when the forum has low load) to compare?

Just as a thought exercise: what exactly would break if this specific job would not exist or run?

I can’t find any place in the code where Post.avg_time and Topic.avg_time are actually being used? Or is there maybe some generated accessor that I’m missing?

「いいね!」 1

It’s used to compute the post score that we use to rank posts so we can summarize them.

「いいね!」 3

I have mentioned this particular issue before… it is totally unrelated to @tgxworld’s work.

I am still not buying that we are getting any value from “geometric mean of all read times for all posts”… a simple… which posts are read most is probably good enough for the “best of” algorithm.

Plus this whole concept is a bit broken imo cause the further you get to the end of a topic the less people read it.

I think a simple fix here, is to stop doing the whole geometric mean read time thing on topics with more than say 1000 posts. Which is oddly the only time we would use it… :frowning:

Silly idea perhaps: Topics have a certain flow, setup, experience to them. Some topics are by nature filled with short posts, others with long posts.

Would it be an idea to calculate the mean over the first 500 or 1000 posts and use that as an indication for all posts in the thread?

After all, it’s an indication so an approximation would be acceptable, no?

I don’t even know if this would help in our case. Fact is that post_timings is bigger than the internal memory of the server (20 GB vs 16 GB) and even limiting to the first x posts would still result in the database failing to pull the entire table into memory.

So my feeling is that attempting to do this query differently would work better than trying to “limit” it.

「いいね!」 2

Hi, I did run a test just now on the server (still relatively active with 30 active users online). These are the results:

19.3 k requests completed in 9.89 s, 75.5 MiB read, 1.95 k iops, 7.63 MiB/s
generated 19.3 k requests in 10.0 s, 75.5 MiB, 1.93 k iops, 7.55 MiB/s

Everything seems to be about 1/7th of the speed you get. Perhaps indeed I should consider switching from old-school HDD to SSDs.

「いいね!」 3

These numbers are terrifyingly bad even for traditional HDDs. Way below what fancy enterprise 10k and 15k drives should be doing.

「いいね!」 4