主 Postgres 数据库进程 (postmaster) 占用全部 CPU

pfaffman · 2018 年4 月 2 日 22:12

我在 DigitalOcean 8GB 的 Droplet 上部署了一个双容器环境，但运行表现非常异常。

有一个 postmaster 进程（编辑：现在有两个了）占用了 100% 的 CPU。
Sidekiq 正在运行，但仪表盘却报错称它未检查更新。

日志中出现如下信息：

  PG::ConnectionBad (FATAL: remaining connection slots are reserved for non-replication superuser connections ) /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/pg-0.21.0/lib/pg.rb:56:in `initialize'

以及

Job exception: FATAL: remaining connection slots are reserved for non-replication superuser connections

数据容器的配置如下：

  db_shared_buffers: "2GB"
  db_work_mem: "40MB"

Web 容器中有 4 个 Unicorn 工作进程（与 CPU 核心数相同）。

已安装的插件：

          - git clone https://github.com/discourse/docker_manager.git
          #- git clone https://github.com/SumatoSoft/discourse-adplugin.git
          #- git clone https://github.com/davidcelis/new_relic-discourse.git
          - git clone https://github.com/discourse/discourse-cakeday.git
          - git clone https://github.com/ekkans/lrqdo-editor-plugin-discourse.git
          #- git clone https://github.com/davidtaylorhq/discourse-whos-online.git
          - git clone https://github.com/pmusaraj/discourse-onesignal.git

内存使用情况：

KiB Mem :  8174936 total,   169976 free,  1288084 used,  6716876 buff/cache
KiB Swap:  2097148 total,  2094304 free,     2844 used.  4369992 avail Mem

mpalmer · 2018 年4 月 3 日 03:05

The postgresql connection limit needs to be increased. That will cause the database as a whole to use more memory, but based on the free output you’ve got plenty that could be used if required. I’d double the current value, and review errors and resource consumption.

pfaffman · 2018 年4 月 3 日 03:36

Uh. Where is that changed?

You mean this?

  db_work_mem: "80MB"

I did that, but I’m still getting a 502 error on the admin dashboard.

The other issue is that this site is using cloudflare with no caching (I’m told). I have included the cloudflare template, but I still suspect something is wrong with cloudflare.

mpalmer · 2018 年4 月 3 日 05:57

It’s the max_connections parameter in postgresql.conf. I don’t see a tunable for that in discourse_docker, so I suspect you’ll need to play games with a pups exec stanza to make the edit.

As for Cloudflare, all the cloudflare template does it make it so that IP addresses get fixed after going through Cloudflare proxying. It doesn’t do anything to make Cloudflare cache. You might want to keep that in a separate topic, rather than mix them together in here.

pfaffman · 2018 年4 月 3 日 20:33

Not one for playing games when they’re not necessary, I went into the data container, edited postgresql.conf by hand, doubled max_connections (from 100 to 200) and, LO! it seems that all is well.

I don’t understand just why I’ve not encountered this before or why this is the solution here. The database doesn’t seem that big and the traffic doesn’t seem that high.

Edit: I have played the games and won!

If anyone else cares. . . stick this in data.yml in hooks in the after_postgres section. I put it after the -exec section.

    # double max_connections to 200
    - replace:
        filename: "/etc/postgresql/9.5/main/postgresql.conf"
        from: /#?max_connections *=.*/
        to: "max_connections = 200"

markersocial · 2019 年9 月 28 日 15:58

抱歉打扰了旧帖。

@pfaffman 这是否解决了你遇到的 Postmaster 进程异常导致的高 CPU 占用问题？

我直接在 postgresql.conf 文件（/var/discourse/shared/standalone/postgres_data/postgresql.conf）中修改了最大连接数，并执行了 ./launcher rebuild app。不过目前尚未注意到任何变化。

pfaffman · 2019 年9 月 28 日 16:32

问题似乎已经解决了。我尝试给 PostgreSQL 分配更多或更少的内存。添加交换空间似乎有所帮助（因此我尝试减少给 PostgreSQL 的内存）。我做的其中一件事可能是备份并恢复数据库，这或许起了作用，也可能完全没用。我没有一劳永逸的解决方案，但这就是我尝试过的这些方法。

eboehnisch · 2020 年5 月 27 日 16:42

安装 2.5.0.beta5 更新后，我也遇到了同样的问题。postmaster 进程一个接一个地出现，它们会尽可能占用大量 CPU，有时需要几分钟才能完成。这逐渐耗尽了服务器的所有 AWS 额度，导致整个论坛变得迟缓甚至无法使用。

增加 max_connections 没有任何效果，重新构建应用也是如此。

在更新到 2.5.0.beta5 之前，我从未遇到过这种情况。请问我应该从哪里着手排查？

RobinTS · 2020 年5 月 27 日 22:47

我们昨天将论坛更新到了 2.5.0.beta5，此后论坛变得缓慢且无响应。目前有少数 Postmaster 任务位于顶部，占用了 90-100% 的 CPU 资源。这导致论坛的许多部分超时并向用户返回 502 错误。

这些任务时有时无，但在它们运行期间，论坛几乎无法正常使用。

codinghorror · 2020 年5 月 27 日 22:55

这不正是 PostgreSQL 12 升级的最终化步骤吗？我认为在从 PG10 迁移到 PG12 后，它需要进行一些内部清理。这种情况会持续一天或更久吗？

RobinTS · 2020 年5 月 28 日 00:46

到目前为止已经过去了 13 小时。

另外，确认一下：我确实是从 PG 10 升级到了 12（我知道可以选择不升级，继续停留在 10，只是想澄清这一点）。

我不确定这是否相关，但访问用户摘要 consistently 会导致 CPU 使用率飙升至 90% 以上，并且总是以 502 错误告终。个人资料的其他部分似乎可以正常工作，尽管速度较慢。

我会在今天持续关注情况，看看问题是否会自行解决，并在此更新进展。

codinghorror · 2020 年5 月 28 日 01:36

迁移后可能需要进行一些清理工作。如果您仔细查看官方升级主题的第一篇帖子，其中包含详细信息和推荐步骤：PostgreSQL 12 update

markersocial · 2020 年5 月 28 日 06:00

提醒一下，我也遇到过同样的问题，通过以下步骤解决了：

RobinTS · 2020 年5 月 28 日 16:20

感谢 @codinghorror 和 @markersocial 提供的说明。已经过去一天多了，看起来一切已恢复正常。我什么都没做，只是等待。

我会继续关注情况，看看是否还会出现 502 错误（这可能是因为非高峰时段用户较少）。

如果再次发生，我会尝试您列出的步骤。

话题		回复	浏览量
Slow Sidekiq + Postmaster using 95%+ CPU (32 cores) after Postgresql Version Upgrade Self-hosting server-resources	22	3261	2020 年5 月 29 日
Discourse Crash due to PSQL connection issue Self-hosting	9	491	2024 年3 月 17 日
Unusually high CPU usage Self-hosting	31	983	2026 年2 月 18 日
Too many connections to DB, how to optimize Support	17	3944	2017 年7 月 27 日
Discourse Bad Gateway after reboot Self-hosting	15	2238	2020 年7 月 2 日

主 Postgres 数据库进程 (postmaster) 占用全部 CPU

相关话题