How to avoid upstream timeouts?

pieterhoogenboom · November 12, 2015, 5:17am

My Discourse is doing great, but I’ve been checking my logs and sometimes the connection times out.

I know this is not a Discourse problem per se, and more of a Docker one, but it might be useful for newbies (like me) to know how to tune our Nginx, Docker or Discourse to avoid this. I already removed the default “fail_timeout”. Not sure if it did anything though.

Any help is much appreciated

Here’s my log:

2015/11/09 19:32:15 [error] 3094#0: *61192 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 45.30.160.151, server: nomadforum.io, request: "POST /message-bus/3be12904dd624ddc83b6adba28cca586/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/3be12904dd624ddc83b6adba28cca586/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/introduce-yourself-who-are-you-where-are-you-and-what-do-you-do/229/2" 
2015/11/09 19:32:16 [error] 3094#0: *71793 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 182.150.160.50, server: nomadforum.io, request: "POST /message-bus/5299d452871c45d198a935c65929d9e0/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/5299d452871c45d198a935c65929d9e0/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/whats-a-great-place-for-remote-work-and-self-realization-in-se-asia/3108" 
2015/11/09 19:32:16 [error] 3095#0: *69594 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 61.90.15.118, server: nomadforum.io, request: "POST /message-bus/c292dc35ed0b4096a52236274383618e/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/c292dc35ed0b4096a52236274383618e/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-do-you-find-love-as-a-digital-nomad/5032/2" 
2015/11/09 19:32:27 [error] 3098#0: *59372 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 125.255.46.197, server: nomadforum.io, request: "POST /message-bus/b21f969688714741b890cafd30d59fde/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/b21f969688714741b890cafd30d59fde/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/if-you-dont-have-a-residence-where-do-you-pay-taxes-as-a-european-citizen/1746/22" 
2015/11/09 19:32:28 [error] 3095#0: *69594 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 61.90.15.118, server: nomadforum.io, request: "POST /message-bus/c292dc35ed0b4096a52236274383618e/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/c292dc35ed0b4096a52236274383618e/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-do-you-find-love-as-a-digital-nomad/5032/2" 
2015/11/09 19:32:41 [error] 3094#0: *71793 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 182.150.160.50, server: nomadforum.io, request: "POST /message-bus/15040174a37442a9a5fb0095d8f64f2d/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/15040174a37442a9a5fb0095d8f64f2d/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/where-should-i-register-my-company-when-operating-in-multiple-countries-in-se-asia/1162"
2015/11/09 19:32:43 [error] 3095#0: *69594 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 61.90.15.118, server: nomadforum.io, request: "POST /message-bus/c292dc35ed0b4096a52236274383618e/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/c292dc35ed0b4096a52236274383618e/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-do-you-find-love-as-a-digital-nomad/5032/2"  
2015/11/09 19:32:44 [error] 3094#0: *74158 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 202.151.194.72, server: nomadforum.io, request: "POST /message-bus/0b5419c26bd6485aab0e3af9b7008572/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/0b5419c26bd6485aab0e3af9b7008572/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/where-can-i-find-public-places-with-good-internet-to-work-in-kuala-lumpur/1302/6" 
2015/11/09 19:32:49 [error] 3099#0: *75627 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 12.27.99.35, server: nomadforum.io, request: "POST /message-bus/4e0ae92b50554232be0432c7dc421710/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/4e0ae92b50554232be0432c7dc421710/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/whats-the-best-mobile-data-option-in-thailand-for-nomads/1193" 
2015/11/09 19:32:58 [error] 3098#0: *75723 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 184.163.246.172, server: nomadforum.io, request: "POST /message-bus/679f981ab5f0459584b5bbd22416117c/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/679f981ab5f0459584b5bbd22416117c/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/best-place-to-stay-in-bangkok/1564" 
2015/11/09 19:33:01 [error] 3095#0: *69594 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 61.90.15.118, server: nomadforum.io, request: "POST /message-bus/c292dc35ed0b4096a52236274383618e/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/c292dc35ed0b4096a52236274383618e/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-do-you-find-love-as-a-digital-nomad/5032/2" 
2015/11/09 19:33:03 [error] 3094#0: *71793 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 182.150.160.50, server: nomadforum.io, request: "POST /message-bus/69bda49b64294b1aba909d50dd9df48b/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/69bda49b64294b1aba909d50dd9df48b/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/what-are-good-neighborhoods-in-chiang-mai/4791" 
2015/11/09 19:33:04 [error] 3099#0: *75627 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 12.27.99.35, server: nomadforum.io, request: "POST /message-bus/4e0ae92b50554232be0432c7dc421710/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/4e0ae92b50554232be0432c7dc421710/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/whats-the-best-mobile-data-option-in-thailand-for-nomads/1193" 
2015/11/09 19:33:13 [error] 3098#0: *75723 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 184.163.246.172, server: nomadforum.io, request: "POST /message-bus/679f981ab5f0459584b5bbd22416117c/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/679f981ab5f0459584b5bbd22416117c/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/best-place-to-stay-in-bangkok/1564" 
2015/11/09 19:33:15 [error] 3094#0: *61192 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 45.30.160.151, server: nomadforum.io, request: "POST /message-bus/3be12904dd624ddc83b6adba28cca586/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/3be12904dd624ddc83b6adba28cca586/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/introduce-yourself-who-are-you-where-are-you-and-what-do-you-do/229/2"

mpalmer · November 12, 2015, 5:23am

That’s an indication that you don’t have enough Unicorn workers for the number of requests your site is seeing. Consider increasing the UNICORN_WORKERS setting in your app.yml and rebuilding (you might need to get more RAM to accommodate the larger number of workers).

pieterhoogenboom · November 12, 2015, 5:32am

Thanks! I have 8GB of RAM, what amount of workers do you suggest? I do have other sites running it on, a few. It’s set now to the default that’s commented out:

env:
  LANG: en_US.UTF-8
  ## TODO: How many concurrent web requests are supported?
  ## With 2GB we recommend 3-4 workers, with 1GB only 2
  #UNICORN_WORKERS: 3
  ##  
  ## TODO: List of comma delimited emails that will be made admin and developer
  ## on initial signup example 'user1@example.com,user2@example.com'
  DISCOURSE_DEVELOPER_EMAILS: 'hello@nomadlist.io'
  ##
  ## TODO: The domain name this Discourse instance will respond to
  DISCOURSE_HOSTNAME: 'nomadforum.io'
  ##

Should I uncomment it and set it to 12 workers?

codinghorror · November 12, 2015, 5:49am

General advice is one worker per CPU core (real or logical). I would allow at least 512mb ram per worker as well, so here is hoping you have at least 512mb times the number of logical cores on your server

pieterhoogenboom · November 12, 2015, 5:51am

Thanks! Just did it. Hope it works

pieterhoogenboom · November 12, 2015, 6:37am

Ok I just set it to 6 workers (as I have a Linode 8096: 8 GB RAM, 6 CPU Cores, 192 GB SSD Storage https://www.linode.com/pricing)

env:
  LANG: en_US.UTF-8
  ## TODO: How many concurrent web requests are supported?
  ## With 2GB we recommend 3-4 workers, with 1GB only 2
  UNICORN_WORKERS: 6
  ##
  ## TODO: List of comma delimited emails that will be made admin and developer
  ## on initial signup example 'user1@example.com,user2@example.com'
  DISCOURSE_DEVELOPER_EMAILS: 'hello@nomadlist.io'
  ##
  ## TODO: The domain name this Discourse instance will respond to
  DISCOURSE_HOSTNAME: 'nomadforum.io'
  ##

Then I rebuild it: /var/discourse/launcher rebuild app

Which took a few minutes and then it was up again. But I’m still getting timeouts:

2015/11/11 22:33:24 [error] 11980#0: *19753 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 110.139.184.21, server: nomadforum.io, request: "POST /message-bus/8bd73af468ec44a2ae75ec6534a44696/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/8bd73af468ec44a2ae75ec6534a44696/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-to-make-a-macbook-air-ergonomic/5090"
2015/11/11 22:33:27 [error] 11985#0: *23174 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 201.37.160.13, server: nomadforum.io, request: "POST /message-bus/f8bab6397d4b46bca04f91664627c7e9/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/f8bab6397d4b46bca04f91664627c7e9/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/can-you-create-a-foreign-company-and-hire-yourself-as-an-employee/5172"
2015/11/11 22:33:39 [error] 11980#0: *19753 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 110.139.184.21, server: nomadforum.io, request: "POST /message-bus/8bd73af468ec44a2ae75ec6534a44696/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/8bd73af468ec44a2ae75ec6534a44696/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-to-make-a-macbook-air-ergonomic/5090"
2015/11/11 22:33:47 [error] 11985#0: *23174 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 201.37.160.13, server: nomadforum.io, request: "POST /message-bus/f8bab6397d4b46bca04f91664627c7e9/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/f8bab6397d4b46bca04f91664627c7e9/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/can-you-create-a-foreign-company-and-hire-yourself-as-an-employee/5172"

Any idea how I can debug this further? Here’s a (filtered) print from my htop. At least I’m seeing 6 workers now, that’s good:

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
18587 nobody     20   0  394M 83116 65048 S 11.8  1.0  0:08.59 php-fpm: pool www
 4432 mysql      20   0 3372M 83032     0 S  0.7  1.0  4:08.74 /usr/sbin/mysqld
 4436 mysql      20   0 3372M 83032     0 S  0.0  1.0  3:00.56 /usr/sbin/mysqld
 4224 mysql      20   0 3372M 83032     0 S  0.0  1.0  3h02:41 /usr/sbin/mysqld
11985 www-data   20   0  117M 34704  4608 S  1.3  0.4  0:12.70 nginx: worker process
22228 www-data   20   0  118M  4472  2784 S  0.7  0.1  0:00.44 nginx: worker process
22230 www-data   20   0  119M  4844  2932 S  0.0  0.1  0:00.86 nginx: worker process
22349 pymcrdkib  20   0  466M  215M 10280 S  0.0  2.7  0:14.37 unicorn worker[0] -E production -c config/unicorn.conf.rb
22766 pymcrdkib  20   0  464M  216M 10160 S  0.0  2.7  0:11.96 unicorn worker[1] -E production -c config/unicorn.conf.rb
22899 pymcrdkib  20   0  462M  217M 10032 S  0.7  2.7  0:12.13 unicorn worker[2] -E production -c config/unicorn.conf.rb
23241 pymcrdkib  20   0  462M  216M 10108 S  0.7  2.7  0:12.58 unicorn worker[3] -E production -c config/unicorn.conf.rb
23603 pymcrdkib  20   0 1020M  228M 15668 S  1.3  2.9  0:13.10 unicorn worker[4] -E production -c config/unicorn.conf.rb
23809 pymcrdkib  20   0  414M  171M  9752 S  0.7  2.1  0:11.15 unicorn worker[5] -E production -c config/unicorn.conf.rb
22327 pymcrdkib  20   0 1294M  272M 15704 S  0.0  3.4  0:41.49 sidekiq 3.5.0 discourse [0 of 5 busy]
22308 pymcrdkib  20   0 1294M  272M 15704 S  0.0  3.4  5:16.87 sidekiq 3.5.0 discourse [0 of 5 busy]
22238 messagebu  20   0  377M 10768  9384 S  0.0  0.1  0:00.43 postgres: wal writer process
 3382 root       20   0 67900     8     4 S  0.0  0.0  0:00.00 su couchdb -c /usr/bin/couchdb
22210 mysql      20   0  173M  139M  2692 S  0.0  1.7  0:18.25 /usr/bin/redis-server *:6379
16603 root       20   0  134M 18948  4196 S  0.0  0.2  5:33.98 /usr/bin/irssi
18570 nobody     20   0  437M 75224 16592 S  0.0  0.9  0:11.23 php-fpm: pool www
18718 nobody     20   0  429M 96024 44976 S  0.0  1.2  0:03.55 php-fpm: pool www

Load average: load average: 0.77 1.11 1.49

Seems OK too. Can I optimize something else?

codinghorror · November 12, 2015, 7:09am

Make sure you have upped database ram in app.yml. Other than that and adjusting number of unicorn workers, that’s about all I can think of. Not sure message bus polling timeouts really matter though.

pieterhoogenboom · November 12, 2015, 8:32am

Message bus is used to poll for updates (e.g. new posts, threads) and update the page?

codinghorror · November 13, 2015, 1:32am

Yes but a “failure” there would never be visible to the user, and may not matter. Since the bus is always active, more messages will arrive later…

codinghorror · November 13, 2015, 6:43am

Based on your feedback I added a line to clarify that Unicorns should scale with CPU cores, when possible

## TODO: How many concurrent web requests are supported?
## With 2GB we recommend 3-4 workers, with 1GB only 2
## If you have lots of memory, use one or two workers per logical CPU core
#UNICORN_WORKERS: 3

pfaffman · September 15, 2017, 4:51pm

(This is an old topic, but it still seems like where this belongs)

So the “ram per worker” is db_work_mem? That seems right, but the comments there don’t seem to say what you said above.

  ## Set higher on large instances it defaults to 10MB, for a 3GB install 40MB is a good default
  ## this improves sorting performance, but adds memory usage per-connection
  db_work_mem: "512MB"

Falco · September 15, 2017, 5:35pm

work_mem is used several times per query.

So a complex query (like suggested topics) can use 5~10 times that, and this for each query running and you are fast into swapping territory.

The allow memory per work that Jeff cited, is more on don’t set 10 unicorn workers if you have a 10CPU server with 1GB ram.

pfaffman · September 15, 2017, 5:43pm

You’re back!

OK, I get that I’m wrong. Is this more on target?

so with 8CPUs and 12GB
db_work_mem: 40MB
UNICORN_WORKERS: 8

and with 4 CPUs and 8GB
db_work_mem: 40MB
UNICORN_WORKERS: 4

And then I think also involved is the size of the database itself, which we want to be able to get buffered. With a huge database (say postgres_data is 8GB of disk space and there’s 8GB of ram), you might want to reduce the memory for the unicorn workers?

Falco · September 15, 2017, 5:48pm

Packing my stuff and took a minute to check Meta

Sounds good!

At this size, I would split machines:

DB Machine
Web Machine(s)

Then you can check timings (the awesome mini_profiler info should be enough) and scale machines independently.

trobiyo · February 21, 2019, 8:35am

db_work_mem: 40MB
UNICORN_WORKERS: 8
and with 4 CPUs and 8GB
db_work_mem: 40MB
UNICORN_WORKERS: 4

Hi @Falco,

If we traspolate those values to use Puma, what is the recommended number of Threads per worker you suggest? In the case we are talking about a site with 1M requests per day.

Cheers,
Ismael

pfaffman · February 21, 2019, 1:00pm

Is this a site that your are looking at moving to discourse? For 1M/day you may need to move to a more complex setup with haproxy or traefik reverse proxying to multiple backends (which is not supported here).

How big is your database?

itsbhanusharma · February 21, 2019, 1:03pm

You can do multiple backend with nginx as well and it is well within the spec.

Just serve static via a CDN

pfaffman · February 21, 2019, 1:07pm

But only if you already understand that sentence.

trobiyo · February 21, 2019, 2:02pm

Hi @pfaffman,

Many thanks for your reply. Let’s start for the beginning. Ours is an unsupported installation, since we are using Openshift + HAProxy to deploy the forums. In fact, that specific forum is running, with a configuration like the following one:

4 Puma workers with 128 Threads/worker.
Nginx in front.
For both uploads/backups we are using CephFS network storage.
Since it’s being deployed under Openshift, we are sharing memory/cpu with other projects, but in general we are talking about servers with 10 cores + ~30 GB of ram.

Both Puma and Nginx are following your upstream code, with some little changes.

About the database, we are talking about a database of ~1.2 GB.

In general it works very smooth, but I would like to pinpoint even more Puma configuration. I mean, I would like to know under what circumstamces works better, with more workers (less threads), or more thread per worker, finding the right balance.

Cheers,
Ismael

Falco · February 21, 2019, 6:44pm

Well, that’s what it means.

We run some instances that autoscale to hundreds of Unicorns, but we don’t really use Puma in production.

Topic		Replies	Views
Due to extreme load, this is temporarily being shown to everyone... when it's not really the case Installation server-resources	19	1601	July 21, 2023
Recommendation for number of workers: cores × 2? Installation	16	184	April 29, 2025
High CPU usage (Ruby) Support server-resources	21	327	March 5, 2025
Best configurations for speeding up standalone discourse Installation server-resources	26	4008	February 14, 2024
502 Bad Gateway for only two Topics Installation arm	20	1741	November 17, 2022

How to avoid upstream timeouts?

Related topics