How to avoid upstream timeouts?


#1

My Discourse is doing great, but I’ve been checking my logs and sometimes the connection times out.

I know this is not a Discourse problem per se, and more of a Docker one, but it might be useful for newbies (like me) to know how to tune our Nginx, Docker or Discourse to avoid this. I already removed the default “fail_timeout”. Not sure if it did anything though.

Any help is much appreciated :slight_smile:

Here’s my log:

2015/11/09 19:32:15 [error] 3094#0: *61192 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 45.30.160.151, server: nomadforum.io, request: "POST /message-bus/3be12904dd624ddc83b6adba28cca586/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/3be12904dd624ddc83b6adba28cca586/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/introduce-yourself-who-are-you-where-are-you-and-what-do-you-do/229/2" 
2015/11/09 19:32:16 [error] 3094#0: *71793 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 182.150.160.50, server: nomadforum.io, request: "POST /message-bus/5299d452871c45d198a935c65929d9e0/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/5299d452871c45d198a935c65929d9e0/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/whats-a-great-place-for-remote-work-and-self-realization-in-se-asia/3108" 
2015/11/09 19:32:16 [error] 3095#0: *69594 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 61.90.15.118, server: nomadforum.io, request: "POST /message-bus/c292dc35ed0b4096a52236274383618e/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/c292dc35ed0b4096a52236274383618e/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-do-you-find-love-as-a-digital-nomad/5032/2" 
2015/11/09 19:32:27 [error] 3098#0: *59372 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 125.255.46.197, server: nomadforum.io, request: "POST /message-bus/b21f969688714741b890cafd30d59fde/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/b21f969688714741b890cafd30d59fde/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/if-you-dont-have-a-residence-where-do-you-pay-taxes-as-a-european-citizen/1746/22" 
2015/11/09 19:32:28 [error] 3095#0: *69594 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 61.90.15.118, server: nomadforum.io, request: "POST /message-bus/c292dc35ed0b4096a52236274383618e/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/c292dc35ed0b4096a52236274383618e/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-do-you-find-love-as-a-digital-nomad/5032/2" 
2015/11/09 19:32:41 [error] 3094#0: *71793 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 182.150.160.50, server: nomadforum.io, request: "POST /message-bus/15040174a37442a9a5fb0095d8f64f2d/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/15040174a37442a9a5fb0095d8f64f2d/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/where-should-i-register-my-company-when-operating-in-multiple-countries-in-se-asia/1162"
2015/11/09 19:32:43 [error] 3095#0: *69594 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 61.90.15.118, server: nomadforum.io, request: "POST /message-bus/c292dc35ed0b4096a52236274383618e/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/c292dc35ed0b4096a52236274383618e/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-do-you-find-love-as-a-digital-nomad/5032/2"  
2015/11/09 19:32:44 [error] 3094#0: *74158 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 202.151.194.72, server: nomadforum.io, request: "POST /message-bus/0b5419c26bd6485aab0e3af9b7008572/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/0b5419c26bd6485aab0e3af9b7008572/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/where-can-i-find-public-places-with-good-internet-to-work-in-kuala-lumpur/1302/6" 
2015/11/09 19:32:49 [error] 3099#0: *75627 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 12.27.99.35, server: nomadforum.io, request: "POST /message-bus/4e0ae92b50554232be0432c7dc421710/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/4e0ae92b50554232be0432c7dc421710/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/whats-the-best-mobile-data-option-in-thailand-for-nomads/1193" 
2015/11/09 19:32:58 [error] 3098#0: *75723 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 184.163.246.172, server: nomadforum.io, request: "POST /message-bus/679f981ab5f0459584b5bbd22416117c/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/679f981ab5f0459584b5bbd22416117c/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/best-place-to-stay-in-bangkok/1564" 
2015/11/09 19:33:01 [error] 3095#0: *69594 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 61.90.15.118, server: nomadforum.io, request: "POST /message-bus/c292dc35ed0b4096a52236274383618e/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/c292dc35ed0b4096a52236274383618e/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-do-you-find-love-as-a-digital-nomad/5032/2" 
2015/11/09 19:33:03 [error] 3094#0: *71793 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 182.150.160.50, server: nomadforum.io, request: "POST /message-bus/69bda49b64294b1aba909d50dd9df48b/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/69bda49b64294b1aba909d50dd9df48b/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/what-are-good-neighborhoods-in-chiang-mai/4791" 
2015/11/09 19:33:04 [error] 3099#0: *75627 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 12.27.99.35, server: nomadforum.io, request: "POST /message-bus/4e0ae92b50554232be0432c7dc421710/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/4e0ae92b50554232be0432c7dc421710/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/whats-the-best-mobile-data-option-in-thailand-for-nomads/1193" 
2015/11/09 19:33:13 [error] 3098#0: *75723 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 184.163.246.172, server: nomadforum.io, request: "POST /message-bus/679f981ab5f0459584b5bbd22416117c/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/679f981ab5f0459584b5bbd22416117c/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/best-place-to-stay-in-bangkok/1564" 
2015/11/09 19:33:15 [error] 3094#0: *61192 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 45.30.160.151, server: nomadforum.io, request: "POST /message-bus/3be12904dd624ddc83b6adba28cca586/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/3be12904dd624ddc83b6adba28cca586/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/introduce-yourself-who-are-you-where-are-you-and-what-do-you-do/229/2"

(Matt Palmer) #2

That’s an indication that you don’t have enough Unicorn workers for the number of requests your site is seeing. Consider increasing the UNICORN_WORKERS setting in your app.yml and rebuilding (you might need to get more RAM to accommodate the larger number of workers).


#3

Thanks! I have 8GB of RAM, what amount of workers do you suggest? I do have other sites running it on, a few. It’s set now to the default that’s commented out:

env:
  LANG: en_US.UTF-8
  ## TODO: How many concurrent web requests are supported?
  ## With 2GB we recommend 3-4 workers, with 1GB only 2
  #UNICORN_WORKERS: 3
  ##  
  ## TODO: List of comma delimited emails that will be made admin and developer
  ## on initial signup example 'user1@example.com,user2@example.com'
  DISCOURSE_DEVELOPER_EMAILS: 'hello@nomadlist.io'
  ##
  ## TODO: The domain name this Discourse instance will respond to
  DISCOURSE_HOSTNAME: 'nomadforum.io'
  ##

Should I uncomment it and set it to 12 workers?


(Jeff Atwood) #4

General advice is one worker per CPU core (real or logical). I would allow at least 512mb ram per worker as well, so here is hoping you have at least 512mb times the number of logical cores on your server :wink:


#5

Thanks! Just did it. Hope it works :smiley:


#6

Ok I just set it to 6 workers (as I have a Linode 8096: 8 GB RAM, 6 CPU Cores, 192 GB SSD Storage Cloud Server Plans and Pricing - Linode)

env:
  LANG: en_US.UTF-8
  ## TODO: How many concurrent web requests are supported?
  ## With 2GB we recommend 3-4 workers, with 1GB only 2
  UNICORN_WORKERS: 6
  ##
  ## TODO: List of comma delimited emails that will be made admin and developer
  ## on initial signup example 'user1@example.com,user2@example.com'
  DISCOURSE_DEVELOPER_EMAILS: 'hello@nomadlist.io'
  ##
  ## TODO: The domain name this Discourse instance will respond to
  DISCOURSE_HOSTNAME: 'nomadforum.io'
  ##

Then I rebuild it: /var/discourse/launcher rebuild app

Which took a few minutes and then it was up again. But I’m still getting timeouts:

2015/11/11 22:33:24 [error] 11980#0: *19753 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 110.139.184.21, server: nomadforum.io, request: "POST /message-bus/8bd73af468ec44a2ae75ec6534a44696/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/8bd73af468ec44a2ae75ec6534a44696/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-to-make-a-macbook-air-ergonomic/5090"
2015/11/11 22:33:27 [error] 11985#0: *23174 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 201.37.160.13, server: nomadforum.io, request: "POST /message-bus/f8bab6397d4b46bca04f91664627c7e9/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/f8bab6397d4b46bca04f91664627c7e9/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/can-you-create-a-foreign-company-and-hire-yourself-as-an-employee/5172"
2015/11/11 22:33:39 [error] 11980#0: *19753 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 110.139.184.21, server: nomadforum.io, request: "POST /message-bus/8bd73af468ec44a2ae75ec6534a44696/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/8bd73af468ec44a2ae75ec6534a44696/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/how-to-make-a-macbook-air-ergonomic/5090"
2015/11/11 22:33:47 [error] 11985#0: *23174 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 201.37.160.13, server: nomadforum.io, request: "POST /message-bus/f8bab6397d4b46bca04f91664627c7e9/poll? HTTP/1.1", upstream: "http://127.0.0.1:4578/message-bus/f8bab6397d4b46bca04f91664627c7e9/poll?", host: "nomadforum.io", referrer: "https://nomadforum.io/t/can-you-create-a-foreign-company-and-hire-yourself-as-an-employee/5172"

Any idea how I can debug this further? Here’s a (filtered) print from my htop. At least I’m seeing 6 workers now, that’s good:

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
18587 nobody     20   0  394M 83116 65048 S 11.8  1.0  0:08.59 php-fpm: pool www
 4432 mysql      20   0 3372M 83032     0 S  0.7  1.0  4:08.74 /usr/sbin/mysqld
 4436 mysql      20   0 3372M 83032     0 S  0.0  1.0  3:00.56 /usr/sbin/mysqld
 4224 mysql      20   0 3372M 83032     0 S  0.0  1.0  3h02:41 /usr/sbin/mysqld
11985 www-data   20   0  117M 34704  4608 S  1.3  0.4  0:12.70 nginx: worker process
22228 www-data   20   0  118M  4472  2784 S  0.7  0.1  0:00.44 nginx: worker process
22230 www-data   20   0  119M  4844  2932 S  0.0  0.1  0:00.86 nginx: worker process
22349 pymcrdkib  20   0  466M  215M 10280 S  0.0  2.7  0:14.37 unicorn worker[0] -E production -c config/unicorn.conf.rb
22766 pymcrdkib  20   0  464M  216M 10160 S  0.0  2.7  0:11.96 unicorn worker[1] -E production -c config/unicorn.conf.rb
22899 pymcrdkib  20   0  462M  217M 10032 S  0.7  2.7  0:12.13 unicorn worker[2] -E production -c config/unicorn.conf.rb
23241 pymcrdkib  20   0  462M  216M 10108 S  0.7  2.7  0:12.58 unicorn worker[3] -E production -c config/unicorn.conf.rb
23603 pymcrdkib  20   0 1020M  228M 15668 S  1.3  2.9  0:13.10 unicorn worker[4] -E production -c config/unicorn.conf.rb
23809 pymcrdkib  20   0  414M  171M  9752 S  0.7  2.1  0:11.15 unicorn worker[5] -E production -c config/unicorn.conf.rb
22327 pymcrdkib  20   0 1294M  272M 15704 S  0.0  3.4  0:41.49 sidekiq 3.5.0 discourse [0 of 5 busy]
22308 pymcrdkib  20   0 1294M  272M 15704 S  0.0  3.4  5:16.87 sidekiq 3.5.0 discourse [0 of 5 busy]
22238 messagebu  20   0  377M 10768  9384 S  0.0  0.1  0:00.43 postgres: wal writer process
 3382 root       20   0 67900     8     4 S  0.0  0.0  0:00.00 su couchdb -c /usr/bin/couchdb
22210 mysql      20   0  173M  139M  2692 S  0.0  1.7  0:18.25 /usr/bin/redis-server *:6379
16603 root       20   0  134M 18948  4196 S  0.0  0.2  5:33.98 /usr/bin/irssi
18570 nobody     20   0  437M 75224 16592 S  0.0  0.9  0:11.23 php-fpm: pool www
18718 nobody     20   0  429M 96024 44976 S  0.0  1.2  0:03.55 php-fpm: pool www

Load average: load average: 0.77 1.11 1.49

Seems OK too. Can I optimize something else?


(Jeff Atwood) #7

Make sure you have upped database ram in app.yml. Other than that and adjusting number of unicorn workers, that’s about all I can think of. Not sure message bus polling timeouts really matter though.


#8

Message bus is used to poll for updates (e.g. new posts, threads) and update the page?


(Jeff Atwood) #9

Yes but a “failure” there would never be visible to the user, and may not matter. Since the bus is always active, more messages will arrive later…


(Jeff Atwood) #10

Based on your feedback I added a line to clarify that Unicorns should scale with CPU cores, when possible

## TODO: How many concurrent web requests are supported?
## With 2GB we recommend 3-4 workers, with 1GB only 2
## If you have lots of memory, use one or two workers per logical CPU core
#UNICORN_WORKERS: 3

(Jay Pfaffman) #11

(This is an old topic, but it still seems like where this belongs)

So the “ram per worker” is db_work_mem? That seems right, but the comments there don’t seem to say what you said above.

  ## Set higher on large instances it defaults to 10MB, for a 3GB install 40MB is a good default
  ## this improves sorting performance, but adds memory usage per-connection
  db_work_mem: "512MB"

(Rafael dos Santos Silva) #12

work_mem is used several times per query.

So a complex query (like suggested topics) can use 5~10 times that, and this for each query running and you are fast into swapping territory.

The allow memory per work that Jeff cited, is more on don’t set 10 unicorn workers if you have a 10CPU server with 1GB ram.


(Jay Pfaffman) #13

You’re back!

OK, I get that I’m wrong. Is this more on target?

so with 8CPUs and 12GB
db_work_mem: 40MB
UNICORN_WORKERS: 8

and with 4 CPUs and 8GB
db_work_mem: 40MB
UNICORN_WORKERS: 4

And then I think also involved is the size of the database itself, which we want to be able to get buffered. With a huge database (say postgres_data is 8GB of disk space and there’s 8GB of ram), you might want to reduce the memory for the unicorn workers?


(Rafael dos Santos Silva) #14

Packing my stuff and took a minute to check Meta :blush:

Sounds good!

At this size, I would split machines:

  • DB Machine

  • Web Machine(s)

Then you can check timings (the awesome mini_profiler info should be enough) and scale machines independently.