Discourse Crash due to PSQL connection issue

We kept getting this message on our forum. (like every 3-4 hours). We have 16 cores of CPU and 32GB of ram. I don’t think the resources is an issue.

Oops
The software powering this discussion forum encountered an unexpected problem. We apologize for the inconvenience.

Detailed information about the error was logged, and an automatic notification generated. We'll take a look at it.

No further action is necessary. However, if the error condition persists, you can provide additional detail, including steps to reproduce the error, by posting a discussion topic in the site's feedback category.

The production log shows


app/models/user_auth_token.rb:125:in `lookup'
lib/auth/default_current_user_provider.rb:131:in `current_user'
lib/current_user.rb:35:in `current_user'
app/controllers/application_controller.rb:1047:in `rate_limit_crawlers'
lib/middleware/omniauth_bypass_middleware.rb:64:in `call'
lib/content_security_policy/middleware.rb:12:in `call'
lib/middleware/anonymous_cache.rb:393:in `call'
lib/middleware/csp_script_nonce_injector.rb:12:in `call'
config/initializers/008-rack-cors.rb:14:in `call'
config/initializers/100-silence_logger.rb:27:in `call'
lib/middleware/enforce_hostname.rb:24:in `call'
lib/middleware/request_tracker.rb:236:in `call'
Unexpected error in Message Bus : ActiveRecord::ConnectionNotEstablished : connection to server at "172.17.0.2", port 5432 failed: FATAL:  remaining connection slots are reserved for non-replication superuser connections

Unexpected error in Message Bus : ActiveRecord::ConnectionNotEstablished : connection to server at "172.17.0.2", port 5432 failed: FATAL:  remaining connection slots are reserved for non-replication superuser connections

Unexpected error in Message Bus : ActiveRecord::ConnectionNotEstablished : connection to server at "172.17.0.2", port 5432 failed: FATAL:  remaining connection slots are reserved for non-replication superuser connections

Unexpected error in Message Bus : ActiveRecord::ConnectionNotEstablished : connection to server at "172.17.0.2", port 5432 failed: FATAL:  remaining connection slots are reserved for non-replication superuser connections

We set the following config in

UNICORN_WORKERS: 32
UNICORN_SIDEKIQS: 2

and for psql

db_shared_buffers: "4096MB"

Please let me know what else I can do to improve the config and make sure the server won’t crash.

I’d give postgres (db_shared_buffers) at least 16GB if not 20.

But you need to make more DB connections. I can’t remember just how to do that.

I think it’s max_connections in /etc/postgresql/postgresql.conf (inside the container that’s running postgres) that you need to change.

1 Like

I think so too. :+1:

Are you sure about that? That seems high. See PostgreSQL: Documentation: 13: 19.4. Resource Consumption

If you have a dedicated database server with 1GB or more of RAM, a reasonable starting value for shared_buffers is 25% of the memory in your system. There are some workloads where even larger settings for shared_buffers are effective, but because PostgreSQL also relies on the operating system cache, it is unlikely that an allocation of more than 40% of RAM to shared_buffers will work better than a smaller amount

And this is not a dedicated database server, there are 32 unicorn processes on it as well.

I always defer to you on matters like this, and I thought I was quoting advice you’d given in the past, so, NO I am not sure. :rofl:

It’s pretty clear that the connections is the issue and increasing the RAM to 25% of 32GB might help in general, but isn’t the cause of the error.

EDIT:

Ha! That’s exactly what I remembered, except it looks like I was going to go over 50%. . .

1 Like

I plead sort-of-guilty

But that was then … :wink:

:100:

2 Likes

Why? Those are the cause of you running out of connection slots.

1 Like

I believe I saw somewhere saying UNICORN WORKER should be 2 * CPU. So i did the math to be 32. Should i scale down?

We tried to change the timeout setting for PSQL by running this ALTER ROLE discourse SET statement_timeout = '30000'; . And this query is the one got blocked, once every couple hours.


Not sure if you or anybody else has any idea on what happened?

No, please remove this and let the defaults values take place. This is the classic case of premature optimization.

2 Likes

Yes, you touched stuff you shouldn’t touch :wink:

1 Like