Processo del database Postgres primario (postmaster) che consuma tutta la CPU

pfaffman · 2 Aprile 2018, 10:12pm

I’ve got a 2-container install on a DO 8GB droplet that is behaving very strangely.

There is a postmaster (EDIT: now there are two of them) processing eating 100% CPU.
Sidekiq is running, but the Dashboard complains that it’s not checking for updates.

There are some logs like

  PG::ConnectionBad (FATAL: remaining connection slots are reserved for non-replication superuser connections ) /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/pg-0.21.0/lib/pg.rb:56:in `initialize'

and

Job exception: FATAL: remaining connection slots are reserved for non-replication superuser connections

The data container has:

  db_shared_buffers: "2GB"
  db_work_mem: "40MB"

There are 4 unicorn workers in the web container (same as # processors).

Plugins:

          - git clone https://github.com/discourse/docker_manager.git
          #- git clone https://github.com/SumatoSoft/discourse-adplugin.git
          #- git clone https://github.com/davidcelis/new_relic-discourse.git
          - git clone https://github.com/discourse/discourse-cakeday.git
          - git clone https://github.com/ekkans/lrqdo-editor-plugin-discourse.git
          #- git clone https://github.com/davidtaylorhq/discourse-whos-online.git
          - git clone https://github.com/pmusaraj/discourse-onesignal.git

Memory:

KiB Mem :  8174936 total,   169976 free,  1288084 used,  6716876 buff/cache
KiB Swap:  2097148 total,  2094304 free,     2844 used.  4369992 avail Mem

mpalmer · 3 Aprile 2018, 3:05am

The postgresql connection limit needs to be increased. That will cause the database as a whole to use more memory, but based on the free output you’ve got plenty that could be used if required. I’d double the current value, and review errors and resource consumption.

pfaffman · 3 Aprile 2018, 3:36am

Uh. Where is that changed?

You mean this?

  db_work_mem: "80MB"

I did that, but I’m still getting a 502 error on the admin dashboard.

The other issue is that this site is using cloudflare with no caching (I’m told). I have included the cloudflare template, but I still suspect something is wrong with cloudflare.

mpalmer · 3 Aprile 2018, 5:57am

It’s the max_connections parameter in postgresql.conf. I don’t see a tunable for that in discourse_docker, so I suspect you’ll need to play games with a pups exec stanza to make the edit.

As for Cloudflare, all the cloudflare template does it make it so that IP addresses get fixed after going through Cloudflare proxying. It doesn’t do anything to make Cloudflare cache. You might want to keep that in a separate topic, rather than mix them together in here.

pfaffman · 3 Aprile 2018, 8:33pm

Not one for playing games when they’re not necessary, I went into the data container, edited postgresql.conf by hand, doubled max_connections (from 100 to 200) and, LO! it seems that all is well.

I don’t understand just why I’ve not encountered this before or why this is the solution here. The database doesn’t seem that big and the traffic doesn’t seem that high.

Edit: I have played the games and won!

If anyone else cares. . . stick this in data.yml in hooks in the after_postgres section. I put it after the -exec section.

    # double max_connections to 200
    - replace:
        filename: "/etc/postgresql/9.5/main/postgresql.conf"
        from: /#?max_connections *=.*/
        to: "max_connections = 200"

markersocial · 28 Settembre 2019, 3:58pm

Scusa per aver riattivato un vecchio thread.

@pfaffman, ha risolto per te il problema dell’elevato utilizzo della CPU del Postmaster?

Ho modificato direttamente il numero massimo di connessioni nel file postgresql.conf (/var/discourse/shared/standalone/postgres_data/postgresql.conf) e ho eseguito ./launcher rebuild app. Non ho notato alcuna differenza, però.

pfaffman · 28 Settembre 2019, 4:32pm

Il problema sembra essere scomparso.

Ho provato ad assegnare a PostgreSQL più memoria e poi meno. L’aggiunta di swap sembra aver aiutato (da qui il tentativo di assegnare a pg meno memoria). Una cosa che ho fatto e che potrebbe aver aiutato è stato eseguire il backup e il ripristino del database. Oppure potrebbe essere che non abbia avuto alcun effetto.

Non ho una soluzione magica, ma queste sono le cose che ho fatto.

eboehnisch · 27 Maggio 2020, 4:42pm

Anche a me è iniziato a succedere dopo aver installato l’aggiornamento alla versione 2.5.0.beta5. Uno dopo l’altro, vedo aumentare i processi postmaster che utilizzano quanta più CPU possibile, impiegando a volte diversi minuti prima di completare l’operazione. Lentamente questo consuma tutti i crediti AWS del server, rendendo l’intero forum lento o addirittura inutilizzabile.

L’aumento di max_connections non ha avuto alcun effetto, né lo ha avuto il rebuild dell’applicazione.

Prima di aggiornare alla 2.5.0beta5 non avevo mai riscontrato questo problema. Hai qualche indizio su dove dovrei cercare?

RobinTS · 27 Maggio 2020, 10:47pm

Abbiamo aggiornato il forum alla versione 2.5.0.beta5 ieri, e da allora è diventato lento e non risponde. Ci sono alcuni lavori di postmaster in cima alla coda che consumano il 90-100% della CPU. Questo sta causando il timeout di molte parti del forum e il ritorno di un errore 502 per gli utenti.

I lavori vanno e vengono, ma mentre sono attivi il forum non è molto utilizzabile.

codinghorror · 27 Maggio 2020, 10:55pm

Non sarebbero questi i passaggi di finalizzazione dell’aggiornamento di Postgres 12? Credo che ci sia una sorta di pulizia interna da eseguire dopo la migrazione da PG10 a PG12. La situazione persiste per un giorno o più?

RobinTS · 28 Maggio 2020, 12:46am

Sono già passate 13 ore.

Inoltre, per confermare: sono passato da PG 10 a 12 (so che è possibile rimanere su 10 a scelta, ma volevo solo chiarire).

Non sono sicuro che sia rilevante, ma accedere al riepilogo di un utente fa costantemente schizzare l’utilizzo della CPU oltre il 90% e si conclude sempre con un errore 502. Le altre sezioni del profilo sembrano funzionare, anche se lentamente.

Continuerò a monitorare la situazione nel corso della giornata per vedere se le cose si risolvono da sole e aggiornerò qui.

codinghorror · 28 Maggio 2020, 1:36am

Potrebbe essere necessario un po’ di pulizia dopo la migrazione. Se controlli l’argomento ufficiale di aggiornamento qui e leggi attentamente il primo post, troverai dettagli e passaggi consigliati – PostgreSQL 12 update

markersocial · 28 Maggio 2020, 6:00am

Solo un avviso: ho avuto lo stesso problema ed è stato risolto facendo questo:

RobinTS · 28 Maggio 2020, 4:20pm

Grazie a @codinghorror e @markersocial per le istruzioni. Sono passate più di 24 ore e sembra che tutto sia tornato alla normalità. Non ho fatto altro che aspettare.

Continuerò a monitorare la situazione per vedere se riappaiono altri errori 502 (potrebbe essere dovuto al basso numero di utenti nelle ore non di punta).

Se si ripresenta, proverò i passaggi che avete elencato.

Argomento		Risposte	Visualizzazioni
Slow Sidekiq + Postmaster using 95%+ CPU (32 cores) after Postgresql Version Upgrade Self-hosting server-resources	22	3263	Maggio 29, 2020
Discourse Crash due to PSQL connection issue Self-hosting	9	491	Marzo 17, 2024
Unusually high CPU usage Self-hosting	31	1006	Febbraio 18, 2026
Too many connections to DB, how to optimize Support	17	3944	Luglio 27, 2017
Discourse Bad Gateway after reboot Self-hosting	15	2239	Luglio 2, 2020

Processo del database Postgres primario (postmaster) che consuma tutta la CPU

Argomenti correlati