How to avoid upstream timeouts?

My Discourse is doing great, but I’ve been checking my logs and sometimes the connection times out.

I know this is not a Discourse problem per se, and more of a Docker one, but it might be useful for newbies (like me) to know how to tune our Nginx, Docker or Discourse to avoid this. I already removed the default “fail_timeout”. Not sure if it did anything though.

Any help is much appreciated :slight_smile:

Here’s my log:

2015/11/09 19:32:15 [error] 3094#0: *61192 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 45.30.160.151, server: nomadforum.io, request: "POST /message-bus/3be12904dd624ddc83b6adba28cca586/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/3be12904dd624ddc83b6adba28cca586/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/introduce-yourself-who-are-you-where-are-you-and-what-do-you-do/229/2" 
2015/11/09 19:32:16 [error] 3094#0: *71793 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 182.150.160.50, server: nomadforum.io, request: "POST /message-bus/5299d452871c45d198a935c65929d9e0/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/5299d452871c45d198a935c65929d9e0/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/whats-a-great-place-for-remote-work-and-self-realization-in-se-asia/3108" 
2015/11/09 19:32:16 [error] 3095#0: *69594 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 61.90.15.118, server: nomadforum.io, request: "POST /message-bus/c292dc35ed0b4096a52236274383618e/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/c292dc35ed0b4096a52236274383618e/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-do-you-find-love-as-a-digital-nomad/5032/2" 
2015/11/09 19:32:27 [error] 3098#0: *59372 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 125.255.46.197, server: nomadforum.io, request: "POST /message-bus/b21f969688714741b890cafd30d59fde/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/b21f969688714741b890cafd30d59fde/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/if-you-dont-have-a-residence-where-do-you-pay-taxes-as-a-european-citizen/1746/22" 
2015/11/09 19:32:28 [error] 3095#0: *69594 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 61.90.15.118, server: nomadforum.io, request: "POST /message-bus/c292dc35ed0b4096a52236274383618e/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/c292dc35ed0b4096a52236274383618e/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-do-you-find-love-as-a-digital-nomad/5032/2" 
2015/11/09 19:32:41 [error] 3094#0: *71793 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 182.150.160.50, server: nomadforum.io, request: "POST /message-bus/15040174a37442a9a5fb0095d8f64f2d/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/15040174a37442a9a5fb0095d8f64f2d/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/where-should-i-register-my-company-when-operating-in-multiple-countries-in-se-asia/1162"
2015/11/09 19:32:43 [error] 3095#0: *69594 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 61.90.15.118, server: nomadforum.io, request: "POST /message-bus/c292dc35ed0b4096a52236274383618e/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/c292dc35ed0b4096a52236274383618e/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-do-you-find-love-as-a-digital-nomad/5032/2"  
2015/11/09 19:32:44 [error] 3094#0: *74158 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 202.151.194.72, server: nomadforum.io, request: "POST /message-bus/0b5419c26bd6485aab0e3af9b7008572/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/0b5419c26bd6485aab0e3af9b7008572/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/where-can-i-find-public-places-with-good-internet-to-work-in-kuala-lumpur/1302/6" 
2015/11/09 19:32:49 [error] 3099#0: *75627 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 12.27.99.35, server: nomadforum.io, request: "POST /message-bus/4e0ae92b50554232be0432c7dc421710/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/4e0ae92b50554232be0432c7dc421710/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/whats-the-best-mobile-data-option-in-thailand-for-nomads/1193" 
2015/11/09 19:32:58 [error] 3098#0: *75723 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 184.163.246.172, server: nomadforum.io, request: "POST /message-bus/679f981ab5f0459584b5bbd22416117c/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/679f981ab5f0459584b5bbd22416117c/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/best-place-to-stay-in-bangkok/1564" 
2015/11/09 19:33:01 [error] 3095#0: *69594 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 61.90.15.118, server: nomadforum.io, request: "POST /message-bus/c292dc35ed0b4096a52236274383618e/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/c292dc35ed0b4096a52236274383618e/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-do-you-find-love-as-a-digital-nomad/5032/2" 
2015/11/09 19:33:03 [error] 3094#0: *71793 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 182.150.160.50, server: nomadforum.io, request: "POST /message-bus/69bda49b64294b1aba909d50dd9df48b/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/69bda49b64294b1aba909d50dd9df48b/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/what-are-good-neighborhoods-in-chiang-mai/4791" 
2015/11/09 19:33:04 [error] 3099#0: *75627 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 12.27.99.35, server: nomadforum.io, request: "POST /message-bus/4e0ae92b50554232be0432c7dc421710/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/4e0ae92b50554232be0432c7dc421710/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/whats-the-best-mobile-data-option-in-thailand-for-nomads/1193" 
2015/11/09 19:33:13 [error] 3098#0: *75723 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 184.163.246.172, server: nomadforum.io, request: "POST /message-bus/679f981ab5f0459584b5bbd22416117c/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/679f981ab5f0459584b5bbd22416117c/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/best-place-to-stay-in-bangkok/1564" 
2015/11/09 19:33:15 [error] 3094#0: *61192 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 45.30.160.151, server: nomadforum.io, request: "POST /message-bus/3be12904dd624ddc83b6adba28cca586/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/3be12904dd624ddc83b6adba28cca586/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/introduce-yourself-who-are-you-where-are-you-and-what-do-you-do/229/2"

That’s an indication that you don’t have enough Unicorn workers for the number of requests your site is seeing. Consider increasing the UNICORN_WORKERS setting in your app.yml and rebuilding (you might need to get more RAM to accommodate the larger number of workers).

2 Likes

Thanks! I have 8GB of RAM, what amount of workers do you suggest? I do have other sites running it on, a few. It’s set now to the default that’s commented out:

env:
  LANG: en_US.UTF-8
  ## TODO: How many concurrent web requests are supported?
  ## With 2GB we recommend 3-4 workers, with 1GB only 2
  #UNICORN_WORKERS: 3
  ##  
  ## TODO: List of comma delimited emails that will be made admin and developer
  ## on initial signup example 'user1@example.com,user2@example.com'
  DISCOURSE_DEVELOPER_EMAILS: 'hello@nomadlist.io'
  ##
  ## TODO: The domain name this Discourse instance will respond to
  DISCOURSE_HOSTNAME: 'nomadforum.io'
  ##

Should I uncomment it and set it to 12 workers?

General advice is one worker per CPU core (real or logical). I would allow at least 512mb ram per worker as well, so here is hoping you have at least 512mb times the number of logical cores on your server :wink:

2 Likes

Thanks! Just did it. Hope it works :smiley:

Ok I just set it to 6 workers (as I have a Linode 8096: 8 GB RAM, 6 CPU Cores, 192 GB SSD Storage https://www.linode.com/pricing)

env:
  LANG: en_US.UTF-8
  ## TODO: How many concurrent web requests are supported?
  ## With 2GB we recommend 3-4 workers, with 1GB only 2
  UNICORN_WORKERS: 6
  ##
  ## TODO: List of comma delimited emails that will be made admin and developer
  ## on initial signup example 'user1@example.com,user2@example.com'
  DISCOURSE_DEVELOPER_EMAILS: 'hello@nomadlist.io'
  ##
  ## TODO: The domain name this Discourse instance will respond to
  DISCOURSE_HOSTNAME: 'nomadforum.io'
  ##

Then I rebuild it: /var/discourse/launcher rebuild app

Which took a few minutes and then it was up again. But I’m still getting timeouts:

2015/11/11 22:33:24 [error] 11980#0: *19753 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 110.139.184.21, server: nomadforum.io, request: "POST /message-bus/8bd73af468ec44a2ae75ec6534a44696/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/8bd73af468ec44a2ae75ec6534a44696/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-to-make-a-macbook-air-ergonomic/5090"
2015/11/11 22:33:27 [error] 11985#0: *23174 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 201.37.160.13, server: nomadforum.io, request: "POST /message-bus/f8bab6397d4b46bca04f91664627c7e9/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/f8bab6397d4b46bca04f91664627c7e9/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/can-you-create-a-foreign-company-and-hire-yourself-as-an-employee/5172"
2015/11/11 22:33:39 [error] 11980#0: *19753 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 110.139.184.21, server: nomadforum.io, request: "POST /message-bus/8bd73af468ec44a2ae75ec6534a44696/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/8bd73af468ec44a2ae75ec6534a44696/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-to-make-a-macbook-air-ergonomic/5090"
2015/11/11 22:33:47 [error] 11985#0: *23174 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 201.37.160.13, server: nomadforum.io, request: "POST /message-bus/f8bab6397d4b46bca04f91664627c7e9/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/f8bab6397d4b46bca04f91664627c7e9/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/can-you-create-a-foreign-company-and-hire-yourself-as-an-employee/5172"

Any idea how I can debug this further? Here’s a (filtered) print from my htop. At least I’m seeing 6 workers now, that’s good:

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
18587 nobody     20   0  394M 83116 65048 S 11.8  1.0  0:08.59 php-fpm: pool www
 4432 mysql      20   0 3372M 83032     0 S  0.7  1.0  4:08.74 /usr/sbin/mysqld
 4436 mysql      20   0 3372M 83032     0 S  0.0  1.0  3:00.56 /usr/sbin/mysqld
 4224 mysql      20   0 3372M 83032     0 S  0.0  1.0  3h02:41 /usr/sbin/mysqld
11985 www-data   20   0  117M 34704  4608 S  1.3  0.4  0:12.70 nginx: worker process
22228 www-data   20   0  118M  4472  2784 S  0.7  0.1  0:00.44 nginx: worker process
22230 www-data   20   0  119M  4844  2932 S  0.0  0.1  0:00.86 nginx: worker process
22349 pymcrdkib  20   0  466M  215M 10280 S  0.0  2.7  0:14.37 unicorn worker[0] -E production -c config/unicorn.conf.rb
22766 pymcrdkib  20   0  464M  216M 10160 S  0.0  2.7  0:11.96 unicorn worker[1] -E production -c config/unicorn.conf.rb
22899 pymcrdkib  20   0  462M  217M 10032 S  0.7  2.7  0:12.13 unicorn worker[2] -E production -c config/unicorn.conf.rb
23241 pymcrdkib  20   0  462M  216M 10108 S  0.7  2.7  0:12.58 unicorn worker[3] -E production -c config/unicorn.conf.rb
23603 pymcrdkib  20   0 1020M  228M 15668 S  1.3  2.9  0:13.10 unicorn worker[4] -E production -c config/unicorn.conf.rb
23809 pymcrdkib  20   0  414M  171M  9752 S  0.7  2.1  0:11.15 unicorn worker[5] -E production -c config/unicorn.conf.rb
22327 pymcrdkib  20   0 1294M  272M 15704 S  0.0  3.4  0:41.49 sidekiq 3.5.0 discourse [0 of 5 busy]
22308 pymcrdkib  20   0 1294M  272M 15704 S  0.0  3.4  5:16.87 sidekiq 3.5.0 discourse [0 of 5 busy]
22238 messagebu  20   0  377M 10768  9384 S  0.0  0.1  0:00.43 postgres: wal writer process
 3382 root       20   0 67900     8     4 S  0.0  0.0  0:00.00 su couchdb -c /usr/bin/couchdb
22210 mysql      20   0  173M  139M  2692 S  0.0  1.7  0:18.25 /usr/bin/redis-server *:6379
16603 root       20   0  134M 18948  4196 S  0.0  0.2  5:33.98 /usr/bin/irssi
18570 nobody     20   0  437M 75224 16592 S  0.0  0.9  0:11.23 php-fpm: pool www
18718 nobody     20   0  429M 96024 44976 S  0.0  1.2  0:03.55 php-fpm: pool www

Load average: load average: 0.77 1.11 1.49

Seems OK too. Can I optimize something else?

Make sure you have upped database ram in app.yml. Other than that and adjusting number of unicorn workers, that’s about all I can think of. Not sure message bus polling timeouts really matter though.

Message bus is used to poll for updates (e.g. new posts, threads) and update the page?

Yes but a “failure” there would never be visible to the user, and may not matter. Since the bus is always active, more messages will arrive later…

Based on your feedback I added a line to clarify that Unicorns should scale with CPU cores, when possible

## TODO: How many concurrent web requests are supported?
## With 2GB we recommend 3-4 workers, with 1GB only 2
## If you have lots of memory, use one or two workers per logical CPU core
#UNICORN_WORKERS: 3
3 Likes

(This is an old topic, but it still seems like where this belongs)

So the “ram per worker” is db_work_mem? That seems right, but the comments there don’t seem to say what you said above.

  ## Set higher on large instances it defaults to 10MB, for a 3GB install 40MB is a good default
  ## this improves sorting performance, but adds memory usage per-connection
  db_work_mem: "512MB"
1 Like

work_mem is used several times per query.

So a complex query (like suggested topics) can use 5~10 times that, and this for each query running and you are fast into swapping territory.

The allow memory per work that Jeff cited, is more on don’t set 10 unicorn workers if you have a 10CPU server with 1GB ram.

4 Likes

You’re back!

OK, I get that I’m wrong. Is this more on target?

so with 8CPUs and 12GB
db_work_mem: 40MB
UNICORN_WORKERS: 8

and with 4 CPUs and 8GB
db_work_mem: 40MB
UNICORN_WORKERS: 4

And then I think also involved is the size of the database itself, which we want to be able to get buffered. With a huge database (say postgres_data is 8GB of disk space and there’s 8GB of ram), you might want to reduce the memory for the unicorn workers?

2 Likes

Packing my stuff and took a minute to check Meta :blush:

Sounds good!

At this size, I would split machines:

  • DB Machine

  • Web Machine(s)

Then you can check timings (the awesome mini_profiler info should be enough) and scale machines independently.

6 Likes

db_work_mem: 40MB
UNICORN_WORKERS: 8
and with 4 CPUs and 8GB
db_work_mem: 40MB
UNICORN_WORKERS: 4

Hi @Falco,

If we traspolate those values to use Puma, what is the recommended number of Threads per worker you suggest? In the case we are talking about a site with 1M requests per day.

Cheers,
Ismael

Is this a site that your are looking at moving to discourse? For 1M/day you may need to move to a more complex setup with haproxy or traefik reverse proxying to multiple backends (which is not supported here).

How big is your database?

You can do multiple backend with nginx as well and it is well within the spec.

Just serve static via a CDN

1 Like

But only if you already understand that sentence. :wink:

1 Like

Hi @pfaffman,

Many thanks for your reply. Let’s start for the beginning. Ours is an unsupported installation, since we are using Openshift + HAProxy to deploy the forums. In fact, that specific forum is running, with a configuration like the following one:

  • 4 Puma workers with 128 Threads/worker.
  • Nginx in front.
  • For both uploads/backups we are using CephFS network storage.
  • Since it’s being deployed under Openshift, we are sharing memory/cpu with other projects, but in general we are talking about servers with 10 cores + ~30 GB of ram.

Both Puma and Nginx are following your upstream code, with some little changes.

About the database, we are talking about a database of ~1.2 GB.

In general it works very smooth, but I would like to pinpoint even more Puma configuration. I mean, I would like to know under what circumstamces works better, with more workers (less threads), or more thread per worker, finding the right balance.

Cheers,
Ismael

1 Like

Well, that’s what it means.

We run some instances that autoscale to hundreds of Unicorns, but we don’t really use Puma in production.

4 Likes