Primary Postgres データベースプロセス（postmaster）が CPU をすべて占有

pfaffman · 2018 年 4 月 2 日午後 10:12

I’ve got a 2-container install on a DO 8GB droplet that is behaving very strangely.

There is a postmaster (EDIT: now there are two of them) processing eating 100% CPU.
Sidekiq is running, but the Dashboard complains that it’s not checking for updates.

There are some logs like

  PG::ConnectionBad (FATAL: remaining connection slots are reserved for non-replication superuser connections ) /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/pg-0.21.0/lib/pg.rb:56:in `initialize'

and

Job exception: FATAL: remaining connection slots are reserved for non-replication superuser connections

The data container has:

  db_shared_buffers: "2GB"
  db_work_mem: "40MB"

There are 4 unicorn workers in the web container (same as # processors).

Plugins:

          - git clone https://github.com/discourse/docker_manager.git
          #- git clone https://github.com/SumatoSoft/discourse-adplugin.git
          #- git clone https://github.com/davidcelis/new_relic-discourse.git
          - git clone https://github.com/discourse/discourse-cakeday.git
          - git clone https://github.com/ekkans/lrqdo-editor-plugin-discourse.git
          #- git clone https://github.com/davidtaylorhq/discourse-whos-online.git
          - git clone https://github.com/pmusaraj/discourse-onesignal.git

Memory:

KiB Mem :  8174936 total,   169976 free,  1288084 used,  6716876 buff/cache
KiB Swap:  2097148 total,  2094304 free,     2844 used.  4369992 avail Mem

mpalmer · 2018 年 4 月 3 日午前 3:05

The postgresql connection limit needs to be increased. That will cause the database as a whole to use more memory, but based on the free output you’ve got plenty that could be used if required. I’d double the current value, and review errors and resource consumption.

pfaffman · 2018 年 4 月 3 日午前 3:36

Uh. Where is that changed?

You mean this?

  db_work_mem: "80MB"

I did that, but I’m still getting a 502 error on the admin dashboard.

The other issue is that this site is using cloudflare with no caching (I’m told). I have included the cloudflare template, but I still suspect something is wrong with cloudflare.

mpalmer · 2018 年 4 月 3 日午前 5:57

It’s the max_connections parameter in postgresql.conf. I don’t see a tunable for that in discourse_docker, so I suspect you’ll need to play games with a pups exec stanza to make the edit.

As for Cloudflare, all the cloudflare template does it make it so that IP addresses get fixed after going through Cloudflare proxying. It doesn’t do anything to make Cloudflare cache. You might want to keep that in a separate topic, rather than mix them together in here.

pfaffman · 2018 年 4 月 3 日午後 8:33

Not one for playing games when they’re not necessary, I went into the data container, edited postgresql.conf by hand, doubled max_connections (from 100 to 200) and, LO! it seems that all is well.

I don’t understand just why I’ve not encountered this before or why this is the solution here. The database doesn’t seem that big and the traffic doesn’t seem that high.

Edit: I have played the games and won!

If anyone else cares. . . stick this in data.yml in hooks in the after_postgres section. I put it after the -exec section.

    # double max_connections to 200
    - replace:
        filename: "/etc/postgresql/9.5/main/postgresql.conf"
        from: /#?max_connections *=.*/
        to: "max_connections = 200"

markersocial · 2019 年 9 月 28 日午後 3:58

古いスレッドを掘り起こしてすみません。

@pfaffman さん、これでお使いの Postmaster’s Gone Wild による高 CPU 使用率の問題は解決しましたか？

postgresql.conf (/var/discourse/shared/standalone/postgres_data/postgresql.conf) で直接最大接続数を修正し、./launcher rebuild app を実行しました。が、変化は感じられていません。

pfaffman · 2019 年 9 月 28 日午後 4:32

問題は消えたようです

Postgres にメモリを多く割り当てたり、少なくしたりしてみました。スワップ領域を追加すると改善したようです（そのため、Postgres へのメモリ割り当てを減らしてみました）。役に立ったかもしれないのは、データベースをバックアップして復元したことです。あるいは、何も効果なかったのかもしれません。

魔法の解決策はありませんが、私が試したことは上記の通りです。

eboehnisch · 2020 年 5 月 27 日午後 4:42

私も 2.5.0.beta5 へのアップデート後に同じ現象が発生するようになりました。postmaster プロセスが次々と増加し、それぞれが可能な限り CPU を占有し、完了までに数分かかることもあります。これにより、サーバーの AWS クレジットが徐々に消費され、フォーラム全体が低速化し、場合によっては使用不能になります。

max_connections を増やしても効果はなく、アプリの再ビルドでも変わりませんでした。

2.5.0.beta5 へのアップデート前にはこのような現象は見たことがありません。どこを確認すべきか、何か手がかりはありますか？

RobinTS · 2020 年 5 月 27 日午後 10:47

昨日、フォーラムを 2.5.0.beta5 に更新したところ、それ以降遅延や応答不能が発生しています。現在、いくつかの Postmaster ジョブが最上位にあり、CPU の 90〜100% を占有しています。その結果、フォーラムの多くの部分がタイムアウトし、ユーザーに対して 502 エラーを返しています。

これらのジョブは出現と消滅を繰り返しますが、アクティブな間はフォーラムの利用が非常に困難です。

codinghorror · 2020 年 5 月 27 日午後 10:55

これはPostgres 12へのアップグレードの最終化手順ではないでしょうか？PG10からPG12への移行後、内部のクリーンアップ処理が必要だと考えられます。この状況は1日以上続くのでしょうか？

RobinTS · 2020 年 5 月 28 日午前 12:46

現在、13 時間が経過しました。

また、確認ですが、PG 10 から 12 に移行しました（10 に留まることも可能ですが、念のため明確にしておきます）。

これが関連するかどうかはわかりませんが、ユーザーのサマリーにアクセスすると、CPU 使用率が 90% 以上まで一貫して急上昇し、常に 502 エラーで終わります。プロフィールの他のセクションは機能しているようですが、非常に遅いです。

一日中様子を見て、問題が自然に解決するか確認し、解決次第ここで更新します。

codinghorror · 2020 年 5 月 28 日午前 1:36

移行後にいくつかの整理作業が必要になる可能性があります。公式のアップグレードトピックの最初の投稿を詳しくご覧いただくと、詳細と推奨手順が記載されています。PostgreSQL 12 update

markersocial · 2020 年 5 月 28 日午前 6:00

参考までに、私も同じ問題に遭遇しましたが、以下の方法で解決しました：

RobinTS · 2020 年 5 月 28 日午後 4:20

@codinghorror さん、@markersocial さん、ご指示をいただきありがとうございます。1 日以上経ちましたが、現在は通常通りになっているようです。私は何もせず、ただ待っていました。

引き続き状況を確認し、502 エラーが再度発生しないか見ています（オフピーク時の利用者数が少ないことが原因かもしれません）。

もし再度発生したら、あなたが挙げた手順を試してみます。

トピック		返信	表示
Slow Sidekiq + Postmaster using 95%+ CPU (32 cores) after Postgresql Version Upgrade Self-hosting server-resources	22	3261	2020 年 5 月 29 日
Discourse Crash due to PSQL connection issue Self-hosting	9	491	2024 年 3 月 17 日
Unusually high CPU usage Self-hosting	31	983	2026 年 2 月 18 日
Too many connections to DB, how to optimize Support	17	3944	2017 年 7 月 27 日
Discourse Bad Gateway after reboot Self-hosting	15	2238	2020 年 7 月 2 日

Primary Postgres データベースプロセス（postmaster）が CPU をすべて占有

関連トピック