Since rebuilding today we’re experiencing a high number of server errors. It seems to be an nginx connection issue; in nginx/error.log I sometimes see bursts of 768 worker_connections are not enough messages like this one:
2021/06/02 10:42:21 [alert] 1143#1143: *28468 1768 worker_connections are not enough while connecting to upstream, client: (IP removed), server: _, request: "POST /message-bus/8fc08436f86f47479cf0dad3deb5c4dc/poll?dlp=t HTTP/1.1", upstream: "http://127.0.0.1:3000/message-bus/8fc08436f86f47479cf0dad3deb5c4dc/poll?dlp=t", host: "blenderartists.org", referrer: "https://blenderartists.org/t/convert-multiple-objects-to-single-mesh-with-vertex-grouping/489173/2"
Any ideas how we can remedy this? We have plenty of CPU/memory available - could we increase the number of ‘worker connections’?
Update, I have increased my worker connections for the time being, but I still get these errors (less frequently & for the higher number of workers). I’m really curious if anything changed recently that might cause this, or how I could track this down better.
## Any custom commands to run after building
run:
- exec: echo "Beginning of custom commands"
- replace:
filename: "/etc/nginx/letsencrypt.conf"
from: "worker_connections 768"
to: "worker_connections 1768"
Interesting that this happened after a rebuild, have you recently performed any bulk actions? I’d check the Sidekiq logs and see if there are a large number of jobs there as well.
I did have some bulk actions recently as we switched to the Thumbnail Preview TC, but there’s nothing in my sidekiq queue, I can definitely rule that out.
Haha, usually people use Google Analytics or something like that to know such info. Discourse dashboard has daily pageviews and user visits that can be used to approach that too.
Not true, your whole site is served via Cloudflare:
But that may be completely unrelated as your nginx is complaining about upstream connections instead of downstream ones, which means it’s running out of connection between nginx ⟷ unicorn.
Since we keep an open connection for each visitor due to message_bus (live updates service), this is kinda expected if your site is somewhat popular.
Bumping the worker_processes and worker_connections is safe and sounds like it makes sense in your case. We default worker_processes to your number of CPU cores. How many CPU cores do you have?
True We dropped that a long time ago… We have about 250k pageviews/day (including bots), so 500 doesn’t seem to unusual. The user visits only tracks logged in visits, right?
Right - we do have to pass our requests through CF but we don’t let them touch our javascript etc.
We have 12 cores, 64GB. Typical load is about 2, and we use 50% of our RAM.
The formula for connections is worker_processes * worker_connections which should be 12 * 768, which would be (click clack) 9216. But your logs say 1768…
Try this on your app.yml:
## Any custom commands to run after building
run:
- exec: echo "Beginning of custom commands"
- replace:
filename: "/etc/nginx/nginx.conf"
from: "worker_connections 768"
to: "worker_connections 2000"
- replace:
filename: "/etc/nginx/nginx.conf"
from: "worker_processes auto"
to: "worker_processes 10"
Be aware that your block on post 2 is acting on the wrong file!