'768 worker_connections are not enough' Error

Hey!

Since rebuilding today we’re experiencing a high number of server errors. It seems to be an nginx connection issue; in nginx/error.log I sometimes see bursts of 768 worker_connections are not enough messages like this one:

2021/06/02 10:42:21 [alert] 1143#1143: *28468 1768 worker_connections are not enough while connecting to upstream, client: (IP removed), server: _, request: "POST /message-bus/8fc08436f86f47479cf0dad3deb5c4dc/poll?dlp=t HTTP/1.1", upstream: "http://127.0.0.1:3000/message-bus/8fc08436f86f47479cf0dad3deb5c4dc/poll?dlp=t", host: "blenderartists.org", referrer: "https://blenderartists.org/t/convert-multiple-objects-to-single-mesh-with-vertex-grouping/489173/2"

Any ideas how we can remedy this? We have plenty of CPU/memory available - could we increase the number of ‘worker connections’?

2 Likes

Update, I have increased my worker connections for the time being, but I still get these errors (less frequently & for the higher number of workers). I’m really curious if anything changed recently that might cause this, or how I could track this down better.

## Any custom commands to run after building
run:
  - exec: echo "Beginning of custom commands"

  - replace:
      filename: "/etc/nginx/letsencrypt.conf"
      from: "worker_connections 768" 
      to: "worker_connections 1768"
2 Likes

Interesting that this happened after a rebuild, have you recently performed any bulk actions? I’d check the Sidekiq logs and see if there are a large number of jobs there as well.

1 Like

I did have some bulk actions recently as we switched to the Thumbnail Preview TC, but there’s nothing in my sidekiq queue, I can definitely rule that out.

1 Like

We bumped the nginx version two days ago, so let’s keep an eye on it. Do you get over 500 concurrent visitors on your site?

Also your whole site is behind Cloudflare so stuff may be different because of it.

1 Like

I have no idea - we might? Any ideas how I can check that?

Correct. But I have disabled any acceleration and am basically only using it to cache images and avatars. It’s never been an issue until today…

Haha, usually people use Google Analytics or something like that to know such info. Discourse dashboard has daily pageviews and user visits that can be used to approach that too.

Not true, your whole site is served via Cloudflare:

curl -I https://blenderartists.org/                                                                                                                                         
HTTP/2 200 
cf-cache-status: DYNAMIC
cf-request-id: 0a6ef945b3000002fe272b2000000001
server: cloudflare
cf-ray: 6591c4b5ec5902fe-MIA
alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400, h3=":443"; ma=86400

But that may be completely unrelated as your nginx is complaining about upstream connections instead of downstream ones, which means it’s running out of connection between nginx ⟷ unicorn.

Since we keep an open connection for each visitor due to message_bus (live updates service), this is kinda expected if your site is somewhat popular.

Bumping the worker_processes and worker_connections is safe and sounds like it makes sense in your case. We default worker_processes to your number of CPU cores. How many CPU cores do you have?

2 Likes

True :slight_smile: We dropped that a long time ago… We have about 250k pageviews/day (including bots), so 500 doesn’t seem to unusual. The user visits only tracks logged in visits, right?

Right - we do have to pass our requests through CF but we don’t let them touch our javascript etc.

We have 12 cores, 64GB. Typical load is about 2, and we use 50% of our RAM.

1 Like

Damn that is so weird!

The formula for connections is worker_processes * worker_connections which should be 12 * 768, which would be (click clack) 9216. But your logs say 1768…

Try this on your app.yml:

## Any custom commands to run after building
run:
  - exec: echo "Beginning of custom commands"

  - replace:
      filename: "/etc/nginx/nginx.conf"
      from: "worker_connections 768" 
      to: "worker_connections 2000"
  - replace:
      filename: "/etc/nginx/nginx.conf"
      from: "worker_processes auto" 
      to: "worker_processes 10"

Be aware that your block on post 2 is acting on the wrong file!

4 Likes

:facepalm: I pasted the wrong code - I tried the letsencrypt template first, but ended up changing the nginx.conf to 1768 worker connections.

I’ll give your values a try - I’ll be back to report how it goes.

1 Like

Still getting them, I’m afraid:

2021/06/02 17:40:03 [alert] 2102#2102: *262491 2000 worker_connections are not enough while connecting to upstream, client: <ip removed>, server: _, request: "POST /message-bus/0e453fae0c604c29a876e6ede05b7341/poll?dlp=t HTTP/1.1", upstream: "http://127.0.0.1:3000/message-bus/0e453fae0c604c29a876e6ede05b7341/poll?dlp=t", host: "blenderartists.org", referrer: "https://blenderartists.org/t/weight-paint-not-painting/551282"

I have bumped worker_connections to 4000 and it’s looking good so far :crossed_fingers:

7 Likes

We made it easier to override now:

3 Likes

Cool! So we’d do something like

params:
  nginx_worker_connections: 4000

In app.yml/web_only.yml?

2 Likes

Exactly. We also bumped the default to 4k in the same patch, so admins may want to carefully evaluate if they still need to bump it.

2 Likes

On one site I was also bumping worker processes to 2X CPUs. Should I remove that too?

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.