429 too many connections issue with NGINX in front of NGINX


#1

Hi!
I started having more and more issues with Error 429, Too many connections. Whole forum hangs. It’s starting when is more than 35 users online.
Server is really strong, 4 cores 2,3Ghz, 15Gb of RAM.
Can you please tell me what need to be tuned or fix for better performance? Nginx, or in yaml?

Discourse is on default installation, using docker.
Version - 2.0.0.beta1

Thanks!

Logs:

run-parts: executing /etc/runit/1.d/00-ensure-links
run-parts: executing /etc/runit/1.d/00-fix-var-logs
run-parts: executing /etc/runit/1.d/anacron
run-parts: executing /etc/runit/1.d/cleanup-pids
Cleaning stale PID files
run-parts: executing /etc/runit/1.d/copy-env
run-parts: executing /etc/runit/1.d/enable-brotli
run-parts: executing /etc/runit/1.d/letsencrypt
[Fri Jan 12 12:24:15 UTC 2018] Domains not changed.
[Fri Jan 12 12:24:15 UTC 2018] Skip, Next renewal time is: Sat Mar 10 00:30:21 UTC 2018
[Fri Jan 12 12:24:15 UTC 2018] Add '--force' to force to renew.
[Fri Jan 12 12:24:15 UTC 2018] Installing key to:/shared/ssl/motomirko.pl.key
[Fri Jan 12 12:24:15 UTC 2018] Installing full chain to:/shared/ssl/motomirko.pl.cer
[Fri Jan 12 12:24:15 UTC 2018] Run reload cmd: sv reload nginx
warning: nginx: unable to open supervise/ok: file does not exist
[Fri Jan 12 12:24:15 UTC 2018] Reload error for :
Started runsvdir, PID is 260
ok: run: redis: (pid 271) 0s
ok: run: postgres: (pid 276) 0s
rsyslogd: command 'KLogPermitNonKernelFacility' is currently not permitted - did you already set it via a RainerScript command (v6+ config)? [v8.16.0 try http://www.rsyslog.com/e/2222 ]
rsyslogd: imklog: cannot open kernel log (/proc/kmsg): Operation not permitted.
rsyslogd: activation of module imklog failed [v8.16.0 try http://www.rsyslog.com/e/2145 ]
rsyslogd: Could not open output pipe '/dev/xconsole':: No such file or directory [v8.16.0 try http://www.rsyslog.com/e/2039 ]
supervisor pid: 268 unicorn pid: 293
Shutting Down
run-parts: executing /etc/runit/3.d/01-nginx
ok: down: nginx: 0s, normally up
run-parts: executing /etc/runit/3.d/02-unicorn
exiting
ok: down: unicorn: 1s, normally up
run-parts: executing /etc/runit/3.d/10-redis
ok: down: redis: 0s, normally up
run-parts: executing /etc/runit/3.d/99-postgres
ok: down: postgres: 0s, normally up
When using programs that use GNU Parallel to process data for publication please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; and it won't cost you a cent.
Or you can get GNU Parallel without this requirement by paying 10000 EUR.

To silence this citation notice run 'parallel --bibtex' once or use '--no-notice'.

ok: down: nginx: 3s, normally up
ok: down: postgres: 0s, normally up
ok: down: redis: 2s, normally up
ok: down: unicorn: 3s, normally up
ok: down: cron: 0s, normally up
ok: down: rsyslog: 1s, normally up
run-parts: executing /etc/runit/1.d/00-ensure-links
run-parts: executing /etc/runit/1.d/00-fix-var-logs
run-parts: executing /etc/runit/1.d/anacron
run-parts: executing /etc/runit/1.d/cleanup-pids
Cleaning stale PID files
run-parts: executing /etc/runit/1.d/copy-env
run-parts: executing /etc/runit/1.d/enable-brotli
run-parts: executing /etc/runit/1.d/letsencrypt
[Fri Jan 12 12:51:45 UTC 2018] Domains not changed.
[Fri Jan 12 12:51:46 UTC 2018] Skip, Next renewal time is: Sat Mar 10 00:30:21 UTC 2018
[Fri Jan 12 12:51:46 UTC 2018] Add '--force' to force to renew.
[Fri Jan 12 12:51:46 UTC 2018] Installing key to:/shared/ssl/motomirko.pl.key
[Fri Jan 12 12:51:46 UTC 2018] Installing full chain to:/shared/ssl/motomirko.pl.cer
[Fri Jan 12 12:51:46 UTC 2018] Run reload cmd: sv reload nginx
fail: nginx: runsv not running
[Fri Jan 12 12:51:46 UTC 2018] Reload error for :
Started runsvdir, PID is 255
rsyslogd: command 'KLogPermitNonKernelFacility' is currently not permitted - did you already set it via a RainerScript command (v6+ config)? [v8.16.0 try http://www.rsyslog.com/e/2222 ]
rsyslogd: imklog: cannot open kernel log (/proc/kmsg): Operation not permitted.
rsyslogd: activation of module imklog failed [v8.16.0 try http://www.rsyslog.com/e/2145 ]
rsyslogd: Could not open output pipe '/dev/xconsole':: No such file or directory [v8.16.0 try http://www.rsyslog.com/e/2039 ]
ok: run: redis: (pid 266) 0s
ok: run: postgres: (pid 272) 0s
supervisor pid: 269 unicorn pid: 294
Reopening logs
Reopening logs
Shutting Down
run-parts: executing /etc/runit/3.d/01-nginx
ok: down: nginx: 0s, normally up
run-parts: executing /etc/runit/3.d/02-unicorn
exiting
ok: down: unicorn: 1s, normally up
run-parts: executing /etc/runit/3.d/10-redis
ok: down: redis: 0s, normally up
run-parts: executing /etc/runit/3.d/99-postgres
ok: down: postgres: 0s, normally up
When using programs that use GNU Parallel to process data for publication please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; and it won't cost you a cent.
Or you can get GNU Parallel without this requirement by paying 10000 EUR.

To silence this citation notice run 'parallel --bibtex' once or use '--no-notice'.

ok: down: nginx: 4s, normally up
ok: down: postgres: 0s, normally up
ok: down: redis: 3s, normally up
ok: down: unicorn: 4s, normally up
ok: down: cron: 0s, normally up
ok: down: rsyslog: 0s, normally up
run-parts: executing /etc/runit/1.d/00-ensure-links
run-parts: executing /etc/runit/1.d/00-fix-var-logs
run-parts: executing /etc/runit/1.d/anacron
run-parts: executing /etc/runit/1.d/cleanup-pids
Cleaning stale PID files
run-parts: executing /etc/runit/1.d/copy-env
run-parts: executing /etc/runit/1.d/enable-brotli
run-parts: executing /etc/runit/1.d/letsencrypt
[Sun Jan 14 20:59:12 UTC 2018] Domains not changed.
[Sun Jan 14 20:59:12 UTC 2018] Skip, Next renewal time is: Sat Mar 10 00:30:21 UTC 2018
[Sun Jan 14 20:59:12 UTC 2018] Add '--force' to force to renew.
[Sun Jan 14 20:59:12 UTC 2018] Installing key to:/shared/ssl/motomirko.pl.key
[Sun Jan 14 20:59:12 UTC 2018] Installing full chain to:/shared/ssl/motomirko.pl.cer
[Sun Jan 14 20:59:12 UTC 2018] Run reload cmd: sv reload nginx
fail: nginx: runsv not running
[Sun Jan 14 20:59:12 UTC 2018] Reload error for :
Started runsvdir, PID is 254
ok: run: redis: (pid 264) 0s
ok: run: postgres: (pid 266) 0s
rsyslogd: command 'KLogPermitNonKernelFacility' is currently not permitted - did you already set it via a RainerScript command (v6+ config)? [v8.16.0 try http://www.rsyslog.com/e/2222 ]
rsyslogd: imklog: cannot open kernel log (/proc/kmsg): Operation not permitted.
rsyslogd: activation of module imklog failed [v8.16.0 try http://www.rsyslog.com/e/2145 ]
rsyslogd: Could not open output pipe '/dev/xconsole':: No such file or directory [v8.16.0 try http://www.rsyslog.com/e/2039 ]
supervisor pid: 269 unicorn pid: 293
Reopening logs
Reopening logs

(Bhanu Sharma) #2

Can you provide some details about Your network?
are there any firewalls/Security solutions in front of docker that may be interfering?


#3

250Mbps public bandwidth, standard firewall for ports etc. No limits, etc. I checked from server side, and everything looks ok, network load was really low. It was only <30 users logged in. It looks like internal application issue, errors was served by discourse engine.


(Bhanu Sharma) #4

Maybe you need to disable rate limiting.
take some cues from here if it helps?

EDIT:
This particular post deals with increasing the limits.


#5

I just received
image
When i was trying to answer the post…


(cpradio) #6

Are you using a nginx/apache proxy in front of Discourse? If so, are you properly forwarding the client’s IP address to Discourse?


#7

Let me try that.

Yes, only redirection is in app.yml

  - "8443:443" # https

(cpradio) #8

If you are using a proxy and getting 429 errors after a small set of people join, you probably are not forwarding the client’s IP properly to Discourse and it is seeing everyone as the same server IP, hence why you are hitting the rate limits.

Have you read


#9

I also got 429 today with v2.0.0.beta1 +9. Using, with an NginX configured in front, no change in configuration:

templates:

  • “templates/web.template.yml”
  • “templates/web.ratelimited.template.yml”
  • “templates/web.socketed.template.yml”

I never had this before, and the current use of the instance is not so high. I smell a bug in rate limiting rather.


(Matt Palmer) #10

I smell a bug in your forwarding nginx config.


#11

Ok, so this is my sitename config:

upstream motomirko-prod {
  server 127.0.0.1:8443;
}

server {
  server_name motomirko.pl;
  listen 443 ssl http2;

  include conf.d/ssl;
  ssl_certificate           /var/discourse/shared/standalone/ssl/motomirko.pl.cer;
  ssl_certificate_key       /var/discourse/shared/standalone/ssl/motomirko.pl.key;

  proxy_set_header Host $host;
  proxy_set_header X-Real-IP $remote_addr;
  proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

  # add HSTS for security reasons
  add_header Strict-Transport-Security "max-age=31536000" always;

  location ~ ^/chat/topic/[0-9]+/[0-9]+ {
    #error_log /var/log/nginx/rewrite.log notice;
    #rewrite_log on;
    rewrite "^/chat/topic/([0-9]+)/([0-9]+)$" "/chat/offtop/$1/$2" redirect;
  }


  location / {
    proxy_pass https://motomirko-prod;
  }

  error_page 500 502 503 504 /error.html;
  location = /error.html {
    root /var/www/error/;
    internal;
  }
}

Can you please look at it and tell what I can fix?


#12

Well the configuration didn’t change in a while, except for the http2 timeout option I added after hitting the 429… Here you go… It uses 5 different files:

  1. /etc/nginx/conf.d/discourse.conf:
#
## discourse upstream
#

upstream discourse {
        server unix:/var/discourse/shared/web/nginx.http.sock;
}
  1. /etc/nginx/le.conf:

# LE configuration for 80 and 443

location /.well-known/acme-challenge {
        alias /srv/www/.well-known/acme-challenge;
}

# Add some more security headers

add_header X-Content-Type-Options nosniff;
#add_header X-Frame-Options SAMEORIGIN;
#add_header X-XSS-Protection "1; mode=block";
  1. /etc/nginx/le-ssl.conf:
# SSL Configuration
#
# In /etc/nginx/sites-available/ssl.example.org:
#
# Replace 'ssl.example.org' with your secure domain
# Add the resulting lines to your server configuration:
#
# include              le-ssl.conf
# ssl_certificate      /etc/letsencrypt/live/ssl.example.org/fullchain.pem;
# ssl_certificate_key  /etc/letsencrypt/live/ssl.example.org/privkey.pem;

ssl on;

ssl_ciphers 'ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:!DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA';

ssl_prefer_server_ciphers on;

ssl_dhparam /etc/ssl/dhparams.pem;

ssl_protocols TLSv1 TLSv1.1 TLSv1.2;

ssl_session_cache shared:SSL:10m;

ssl_stapling on;
ssl_stapling_verify on;

add_header Strict-Transport-Security 'max-age=63072000';
  1. /etc/nginx/proxy_params:
proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
  1. /etc/nginx/sites-enabled/ps.zoethical.com:
## ps.zoethical.com
#

server {
        listen       80;
        listen       [::]:80;
        server_name  ps.zoethical.com;

        include le.conf;

        return 301 https://$server_name$request_uri;
}

server {
        listen       443 ssl http2;
        listen       [::]:443 ssl http2;
        server_name  ps.zoethical.com;

        include      le.conf;
        include      le-ssl.conf;

        ssl_certificate      /etc/letsencrypt/live/zoethical.com/fullchain.pem;
        ssl_certificate_key  /etc/letsencrypt/live/zoethical.com/privkey.pem;

        root         /srv/www/zoethical.com/ps;
        index        index.html;

        client_max_body_size 0;
        http2_idle_timeout   5m;

        location /errorpages/ {
                alias /srv/www/zoethical.com/errorpages/;
        }

        location / {
                proxy_pass         http://discourse;
                proxy_http_version 1.1;
                proxy_redirect     off;
                proxy_set_header   Upgrade $http_upgrade;
                proxy_set_header   Connection "upgrade";
                include            proxy_params;

                error_page 502 =502 /errorpages/discourse_offline.html;
        }
}

(Kane York) #13

… what?

Not that I think that’s causing this, but isn’t that asking to treat everything like a websocket connection?


#14

Hmm, there’s no reason to do that at all indeed since Discourse does not use websockets. I can’t remember why I put this here. Might be from Nginx + discourse, and then it’s indeed unnecessary. It’s always good to have other people looking at your configuration files!

But then, nothing like a configuration error leading to 429?


(Matt Palmer) #15

Not in the bits of the config you’ve shared so far, no. But where’s the rest?


(Sam Saffron) #16

You need to use the set_real_ip directive to be consumed from the header that is forwarding the IP to the internal NGINX.

To be honest I would just recommend dropping the rate limiting header and doing the rate limiting in the app using:


(Jeff Atwood) #17

When you put NGINX in front of NGINX you are opting into pain.

Fix your configuration, or stop doing double-NGINX to reduce your configuration’s complexity.


#18

The biggest problem is with discourse update. Without second Nginx with splash screen “Maintenance time, we’ll be back soon” there’s just ugly 404.


#19

I’m not sure what ‘rest’ you’re referring to, Matt. I removed the offending lines and upgraded the security headers a bit. I can show the updated configuration if you like.

Logs in the Web container clearly show that the remote IP is taken into account. I reviewed my logs more thoroughly and realized that since last November I have (very few – 40) instances of rate limiting issues mostly coming from /mini-profiler-resources/results (38).

@sam I wouldn’t recommend dropping the rate limit entirely, unless you’re already behind a proxy doing it for you.


(Sam Saffron) #20

Hmmm so disable mini profiler ?

Not advocating removing all rate limits, just handling application rate limiting in the app if your config is too complex