Primary Postgres database process (postmaster) eating all CPU

I’ve got a 2-container install on a DO 8GB droplet that is behaving very strangely.

There is a postmaster (EDIT: now there are two of them) processing eating 100% CPU.
Sidekiq is running, but the Dashboard complains that it’s not checking for updates.

There are some logs like

  PG::ConnectionBad (FATAL: remaining connection slots are reserved for non-replication superuser connections ) /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/pg-0.21.0/lib/pg.rb:56:in `initialize'

and

Job exception: FATAL: remaining connection slots are reserved for non-replication superuser connections	

The data container has:

  db_shared_buffers: "2GB"
  db_work_mem: "40MB"

There are 4 unicorn workers in the web container (same as # processors).

Plugins:

          - git clone https://github.com/discourse/docker_manager.git
          #- git clone https://github.com/SumatoSoft/discourse-adplugin.git
          #- git clone https://github.com/davidcelis/new_relic-discourse.git
          - git clone https://github.com/discourse/discourse-cakeday.git
          - git clone https://github.com/ekkans/lrqdo-editor-plugin-discourse.git
          #- git clone https://github.com/davidtaylorhq/discourse-whos-online.git
          - git clone https://github.com/pmusaraj/discourse-onesignal.git

Memory:

KiB Mem :  8174936 total,   169976 free,  1288084 used,  6716876 buff/cache
KiB Swap:  2097148 total,  2094304 free,     2844 used.  4369992 avail Mem 

1 Like

The postgresql connection limit needs to be increased. That will cause the database as a whole to use more memory, but based on the free output you’ve got plenty that could be used if required. I’d double the current value, and review errors and resource consumption.

5 Likes

Uh. Where is that changed?

You mean this?

  db_work_mem: "80MB"

I did that, but I’m still getting a 502 error on the admin dashboard.

The other issue is that this site is using cloudflare with no caching (I’m told). I have included the cloudflare template, but I still suspect something is wrong with cloudflare.

1 Like

It’s the max_connections parameter in postgresql.conf. I don’t see a tunable for that in discourse_docker, so I suspect you’ll need to play games with a pups exec stanza to make the edit.

As for Cloudflare, all the cloudflare template does it make it so that IP addresses get fixed after going through Cloudflare proxying. It doesn’t do anything to make Cloudflare cache. You might want to keep that in a separate topic, rather than mix them together in here.

3 Likes

Not one for playing games when they’re not necessary, I went into the data container, edited postgresql.conf by hand, doubled max_connections (from 100 to 200) and, LO! it seems that all is well.

I don’t understand just why I’ve not encountered this before or why this is the solution here. The database doesn’t seem that big and the traffic doesn’t seem that high.

Edit: I have played the games and won!

If anyone else cares. . . stick this in data.yml in hooks in the after_postgres section. I put it after the -exec section.

    # double max_connections to 200
    - replace:
        filename: "/etc/postgresql/9.5/main/postgresql.conf"
        from: /#?max_connections *=.*/
        to: "max_connections = 200"

10 Likes

Sorry to bump an old thread.

@pfaffman Did this solve the postmasters gone wild high CPU usage issue for you?

I modified max connections directly in postgresql.conf (/var/discourse/shared/standalone/postgres_data/postgresql.conf) and used ./launcher rebuild app. Haven’t noticed a difference though.

The problem seems to have gone away

I tried giving postgres more memory and less. Adding swap seemed to have helped (hence trying giving pg less memory) . One thing that I did that might have helped was to backup and restore the database. Or it could be that it did nothing.

I don’t have a silver bullet, but those are the things that I did. :confused:

3 Likes

It now started happening for me, too, after installing the update to 2.5.0.beta5. One by one I get more postmaster processes that use as much CPU as they can get and it takes them sometimes a few minutes before they complete. Slowly this eats up all AWS credits for the server and makes the whole forum sluggish or even unusable.

Increasing max_connections didn’t have any effect, and so did rebuilding the app.

Before I updated to 2.5.0beta5 I’ve never seen this. Any hint where I should look?

2 Likes

We updated our forum to 2.5.0.beta5 yesterday, and since then it has been slow and unresponsive. There are a few postmaster jobs right at the top that eat 90-100% of the CPU. It’s causing many parts of the forum to timeout and return a 502 for the users.

The jobs come and go, but while they are active the forum isn’t very usable.
image

Wouldn’t this be the finalization steps of the Postgres 12 upgrade? I think there’s some internal cleanup it needs to do after migrating from PG10 to PG12? Does the situation persist for a day or more?

It has been 13 hours thus far.

Also, to confirm: I did go from PG 10 to 12 (I know you can optionally stay on 10, so just want to clarify).

I’m not sure if this is relevant, but going to a user’s summary consistently spikes CPU usage to 90%+ and it always ends in a 502. The other sections of the profile seem to work, albeit slowly.

I’ll keep an eye on things over the course of the day to see if things correct themselves and I will update here.

It could be that some cleanup is needed post-migration, if you check the official upgrade topic here and read the first post closely, there are details and recommended steps – PostgreSQL 12 update

2 Likes

Just a heads up, I had the same issue and it was fixed by doing this:

3 Likes

Thank you @codinghorror and @markersocial for the instructions. It’s now been over a day and it looks like things are back to normal. I haven’t done anything but wait.

I’ll keep an eye on things and see if any more 502’s pop up (it might be due to low users during off-peak hours).

If it occurs again I will try the steps that you’ve listed.

3 Likes