主 Postgres 数据库进程 (postmaster) 占用全部 CPU

我在 DigitalOcean 8GB 的 Droplet 上部署了一个双容器环境,但运行表现非常异常。

有一个 postmaster 进程(编辑:现在有两个了)占用了 100% 的 CPU。
Sidekiq 正在运行,但仪表盘却报错称它未检查更新。

日志中出现如下信息:

  PG::ConnectionBad (FATAL: remaining connection slots are reserved for non-replication superuser connections ) /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/pg-0.21.0/lib/pg.rb:56:in `initialize'

以及

Job exception: FATAL: remaining connection slots are reserved for non-replication superuser connections	

数据容器的配置如下:

  db_shared_buffers: "2GB"
  db_work_mem: "40MB"

Web 容器中有 4 个 Unicorn 工作进程(与 CPU 核心数相同)。

已安装的插件:

          - git clone https://github.com/discourse/docker_manager.git
          #- git clone https://github.com/SumatoSoft/discourse-adplugin.git
          #- git clone https://github.com/davidcelis/new_relic-discourse.git
          - git clone https://github.com/discourse/discourse-cakeday.git
          - git clone https://github.com/ekkans/lrqdo-editor-plugin-discourse.git
          #- git clone https://github.com/davidtaylorhq/discourse-whos-online.git
          - git clone https://github.com/pmusaraj/discourse-onesignal.git

内存使用情况:

KiB Mem :  8174936 total,   169976 free,  1288084 used,  6716876 buff/cache
KiB Swap:  2097148 total,  2094304 free,     2844 used.  4369992 avail Mem 
1 个赞

The postgresql connection limit needs to be increased. That will cause the database as a whole to use more memory, but based on the free output you’ve got plenty that could be used if required. I’d double the current value, and review errors and resource consumption.

5 个赞

Uh. Where is that changed?

You mean this?

  db_work_mem: "80MB"

I did that, but I’m still getting a 502 error on the admin dashboard.

The other issue is that this site is using cloudflare with no caching (I’m told). I have included the cloudflare template, but I still suspect something is wrong with cloudflare.

1 个赞

It’s the max_connections parameter in postgresql.conf. I don’t see a tunable for that in discourse_docker, so I suspect you’ll need to play games with a pups exec stanza to make the edit.

As for Cloudflare, all the cloudflare template does it make it so that IP addresses get fixed after going through Cloudflare proxying. It doesn’t do anything to make Cloudflare cache. You might want to keep that in a separate topic, rather than mix them together in here.

3 个赞

Not one for playing games when they’re not necessary, I went into the data container, edited postgresql.conf by hand, doubled max_connections (from 100 to 200) and, LO! it seems that all is well.

I don’t understand just why I’ve not encountered this before or why this is the solution here. The database doesn’t seem that big and the traffic doesn’t seem that high.

Edit: I have played the games and won!

If anyone else cares. . . stick this in data.yml in hooks in the after_postgres section. I put it after the -exec section.

    # double max_connections to 200
    - replace:
        filename: "/etc/postgresql/9.5/main/postgresql.conf"
        from: /#?max_connections *=.*/
        to: "max_connections = 200"

10 个赞

Sorry to bump an old thread.

@pfaffman Did this solve the postmasters gone wild high CPU usage issue for you?

I modified max connections directly in postgresql.conf (/var/discourse/shared/standalone/postgres_data/postgresql.conf) and used ./launcher rebuild app. Haven’t noticed a difference though.

The problem seems to have gone away

I tried giving postgres more memory and less. Adding swap seemed to have helped (hence trying giving pg less memory) . One thing that I did that might have helped was to backup and restore the database. Or it could be that it did nothing.

I don’t have a silver bullet, but those are the things that I did. :confused:

3 个赞

安装 2.5.0.beta5 更新后,我也遇到了同样的问题。postmaster 进程一个接一个地出现,它们会尽可能占用大量 CPU,有时需要几分钟才能完成。这逐渐耗尽了服务器的所有 AWS 额度,导致整个论坛变得迟缓甚至无法使用。

增加 max_connections 没有任何效果,重新构建应用也是如此。

在更新到 2.5.0.beta5 之前,我从未遇到过这种情况。请问我应该从哪里着手排查?

2 个赞

我们昨天将论坛更新到了 2.5.0.beta5,此后论坛变得缓慢且无响应。目前有少数 Postmaster 任务位于顶部,占用了 90-100% 的 CPU 资源。这导致论坛的许多部分超时并向用户返回 502 错误。

这些任务时有时无,但在它们运行期间,论坛几乎无法正常使用。

这不正是 PostgreSQL 12 升级的最终化步骤吗?我认为在从 PG10 迁移到 PG12 后,它需要进行一些内部清理。这种情况会持续一天或更久吗?

到目前为止已经过去了 13 小时。

另外,确认一下:我确实是从 PG 10 升级到了 12(我知道可以选择不升级,继续停留在 10,只是想澄清这一点)。

我不确定这是否相关,但访问用户摘要 consistently 会导致 CPU 使用率飙升至 90% 以上,并且总是以 502 错误告终。个人资料的其他部分似乎可以正常工作,尽管速度较慢。

我会在今天持续关注情况,看看问题是否会自行解决,并在此更新进展。

迁移后可能需要进行一些清理工作。如果您仔细查看官方升级主题的第一篇帖子,其中包含详细信息和推荐步骤:PostgreSQL 12 update

2 个赞

提醒一下,我也遇到过同样的问题,通过以下步骤解决了:

3 个赞

感谢 @codinghorror@markersocial 提供的说明。已经过去一天多了,看起来一切已恢复正常。我什么都没做,只是等待。

我会继续关注情况,看看是否还会出现 502 错误(这可能是因为非高峰时段用户较少)。

如果再次发生,我会尝试您列出的步骤。

3 个赞