Primary Postgres database process (postmaster) eating all CPU

pfaffman · 02.Апрель.2018 22:12:14

I’ve got a 2-container install on a DO 8GB droplet that is behaving very strangely.

There is a postmaster (EDIT: now there are two of them) processing eating 100% CPU.
Sidekiq is running, but the Dashboard complains that it’s not checking for updates.

There are some logs like

  PG::ConnectionBad (FATAL: remaining connection slots are reserved for non-replication superuser connections ) /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/pg-0.21.0/lib/pg.rb:56:in `initialize'

and

Job exception: FATAL: remaining connection slots are reserved for non-replication superuser connections

The data container has:

  db_shared_buffers: "2GB"
  db_work_mem: "40MB"

There are 4 unicorn workers in the web container (same as # processors).

Plugins:

          - git clone https://github.com/discourse/docker_manager.git
          #- git clone https://github.com/SumatoSoft/discourse-adplugin.git
          #- git clone https://github.com/davidcelis/new_relic-discourse.git
          - git clone https://github.com/discourse/discourse-cakeday.git
          - git clone https://github.com/ekkans/lrqdo-editor-plugin-discourse.git
          #- git clone https://github.com/davidtaylorhq/discourse-whos-online.git
          - git clone https://github.com/pmusaraj/discourse-onesignal.git

Memory:

KiB Mem :  8174936 total,   169976 free,  1288084 used,  6716876 buff/cache
KiB Swap:  2097148 total,  2094304 free,     2844 used.  4369992 avail Mem

mpalmer · 03.Апрель.2018 03:05:39

The postgresql connection limit needs to be increased. That will cause the database as a whole to use more memory, but based on the free output you’ve got plenty that could be used if required. I’d double the current value, and review errors and resource consumption.

pfaffman · 03.Апрель.2018 03:36:43

Uh. Where is that changed?

You mean this?

  db_work_mem: "80MB"

I did that, but I’m still getting a 502 error on the admin dashboard.

The other issue is that this site is using cloudflare with no caching (I’m told). I have included the cloudflare template, but I still suspect something is wrong with cloudflare.

mpalmer · 03.Апрель.2018 05:57:34

It’s the max_connections parameter in postgresql.conf. I don’t see a tunable for that in discourse_docker, so I suspect you’ll need to play games with a pups exec stanza to make the edit.

As for Cloudflare, all the cloudflare template does it make it so that IP addresses get fixed after going through Cloudflare proxying. It doesn’t do anything to make Cloudflare cache. You might want to keep that in a separate topic, rather than mix them together in here.

pfaffman · 03.Апрель.2018 20:33:06

Not one for playing games when they’re not necessary, I went into the data container, edited postgresql.conf by hand, doubled max_connections (from 100 to 200) and, LO! it seems that all is well.

I don’t understand just why I’ve not encountered this before or why this is the solution here. The database doesn’t seem that big and the traffic doesn’t seem that high.

Edit: I have played the games and won!

If anyone else cares. . . stick this in data.yml in hooks in the after_postgres section. I put it after the -exec section.

    # double max_connections to 200
    - replace:
        filename: "/etc/postgresql/9.5/main/postgresql.conf"
        from: /#?max_connections *=.*/
        to: "max_connections = 200"

markersocial · 28.Сентябрь.2019 15:58:24

Извините, что поднимаю старую тему.

@pfaffman, помогло ли это вам решить проблему с высоким потреблением процессора из-за почтальонов?

Я изменил максимальное количество подключений напрямую в файле postgresql.conf (/var/discourse/shared/standalone/postgres_data/postgresql.conf) и выполнил ./launcher rebuild app. Однако разницы я пока не заметил.

pfaffman · 28.Сентябрь.2019 16:32:36

Похоже, проблема исчезла.

Я пробовал выделять PostgreSQL больше памяти и меньше. Добавление swap-пространства, похоже, помогло (поэтому я и попробовал выделить PostgreSQL меньше памяти). Одна из вещей, которую я сделал и которая, возможно, помогла, — это резервное копирование и восстановление базы данных. Или же это могло не дать никакого эффекта.

У меня нет универсального решения, но вот что я сделал.

eboehnisch · 27.Май.2020 16:42:25

У меня это тоже началось после установки обновления до версии 2.5.0.beta5. Постепенно появляется всё больше процессов postmaster, которые загружают процессор на максимум, а на их завершение иногда уходит несколько минут. Это медленно съедает все кредиты AWS для сервера и делает весь форум медленным или даже неработоспособным.

Увеличение параметра max_connections не дало никакого эффекта, как и пересборка приложения.

До обновления до версии 2.5.0.beta5 я этого никогда не наблюдал. Есть ли какие-то подсказки, куда стоит посмотреть?

RobinTS · 27.Май.2020 22:47:24

Вчера мы обновили форум до версии 2.5.0.beta5, и с тех пор он работает медленно и безответственно. Сразу вверху появляются несколько заданий Postmaster, которые загружают процессор на 90–100%. Из-за этого многие разделы форума начинают работать с тайм-аутом и возвращают пользователям ошибку 502.

Задания появляются и исчезают, но пока они активны, форумом практически невозможно пользоваться.

codinghorror · 27.Май.2020 22:55:55

Разве это не заключительные этапы обновления Postgres 12? Мне кажется, что после миграции с PG10 на PG12 требуется выполнить внутреннюю очистку. Продлится ли эта ситуация день или больше?

RobinTS · 28.Май.2020 00:46:26

Прошло уже 13 часов.

Также для подтверждения: я перешёл со страницы 10 на 12 (я знаю, что можно опционально остаться на 10, просто хочу уточнить).

Не уверен, что это имеет отношение к делу, но переход к сводке пользователя стабильно вызывает скачок использования ЦП до 90% и выше, и это всегда заканчивается ошибкой 502. Другие разделы профиля, похоже, работают, хотя и медленно.

Я буду следить за ситуацией в течение дня, чтобы увидеть, исправится ли всё само собой, и обновлю информацию здесь.

codinghorror · 28.Май.2020 01:36:16

Возможно, после миграции потребуется очистка. Если вы внимательно изучите первую публикацию в официальной теме обновления по ссылке ниже, там указаны подробности и рекомендуемые шаги: PostgreSQL 12 update

markersocial · 28.Май.2020 06:00:25

На всякий случай, у меня была та же проблема, и она была решена следующим образом:

RobinTS · 28.Май.2020 16:20:15

Спасибо @codinghorror и @markersocial за инструкции. Прошло уже больше суток, и кажется, что всё вернулось в норму. Я ничего не делал, только ждал.

Я буду следить за ситуацией и посмотрю, не появятся ли ещё ошибки 502 (возможно, это связано с низкой активностью пользователей в непиковые часы).

Если это повторится, я попробую выполнить перечисленные вами шаги.

Тема		Ответов	Просм.
Slow Sidekiq + Postmaster using 95%+ CPU (32 cores) after Postgresql Version Upgrade Self-hosting server-resources	22	3261	29.05.2020
Discourse Crash due to PSQL connection issue Self-hosting	9	491	17.03.2024
Unusually high CPU usage Self-hosting	31	980	18.02.2026
Too many connections to DB, how to optimize Support	17	3944	27.07.2017
Discourse Bad Gateway after reboot Self-hosting	15	2238	02.07.2020

Primary Postgres database process (postmaster) eating all CPU

Связанные темы