White screens under higher load, but server is not stressed


(Bart) #1

Hi,

earlier this week I ran an ‘ask me anything’ in my community for a few hours. During this time, I had between 20 and 40 people on the site (measured in Google Analytics realtime). Under normal conditions, my Discourse installation is rock stable, but not many people suffered white screens, connection errors and not being able to post their messages.

During the session I monitored CPU and memory usage and there we all fine - usually less than 30% CPU load, free memory left and swap space was not in use. I’m running on a 2GB Digital Ocean instance with a 1GB swapfile.

I couldn’t find any errors in the rails logs, and I’m quite at a loss what might have caused this. Any ideas?


429 too many connections issue with NGINX in front of NGINX
(Jens Maier) #2

Did you check nginx’s logs?

I have seen a similar problem on an unrelated system, where the admin was using /dev/random for the webserver’s TLS random source. Occasionally, when the webserver tried to re-seed its internal pseudo-random number generator, there wasn’t enough entropy available and the webserver would drop SSL requests until /dev/random had finally produced enough random bits…

Infuriatingly, the cause of these connection drops was never logged… not even a TLS handshake failure was mentioned in the logs. :angry:


(Bart) #3

Ah! You got me thinking - I did check my nginx log, but not the error log of nginx inside docker. Here’s what I saw during the time of the Ask me Anything:

2015/02/27 17:06:45 [error] 50#0: *50491 limiting requests, excess: 100.022 by zone "bot", client: 172.17.42.1, server: _, request: "POST /message-bus/ae853c3a40784a75a97acb3bb43fb359/poll?dlp=t HTTP/1.0", host: "community.blendernation.com", referrer: "http://community.blendernation.com/t/gooseberry-rigging-ama-friday-feb-27th-4pm-6pm-cest/506/17"

2015/02/27 17:08:10 [error] 48#0: *51268 limiting requests, excess: 12.728 by zone "flood", client: 172.17.42.1, server: _, request: "POST /message-bus/ec0b94455cc94751a77814b2125792da/poll? HTTP/1.0", host: "community.blendernation.com", referrer: "http://community.blendernation.com/t/gooseberry-rigging-ama-friday-feb-27th-4pm-6pm-cest/506/57"

Lots of lines about rate limiting. The thing is: they all list the docker IP address as being rate limited. This doesn’t seem to make much sense as this is not protecting the system bots from a single IP. Shouldn’t it rate limit based on the actual client’s IP?

Also, what’s the best practice here to increase these limits? Obviously my system is able to handle more traffic.


(Sam Saffron) #4

Odd, can you tell me a bit more about your setup? Is it standard?


(Bart) #5

Hey Sam,

It’s the standard Docker install on a DigitalOcean 2GB instance. The only non-standard thing would be that my upstream nginx also serves my main domain which is a WordPress installation. The main domain is cached by CloudFlare, the Discourse domain is not configured to use CloudFlare.

Is that useful? What other kind of info would you like to have?


(Sam Saffron) #6

this is a huge difference, recommend you either figure out a replace rule for NGINX that whitelists cloudflare or extract client IPs from the header. Or remove the rate limiting template and rebuild container.


(Bart) #7

Sorry, I wasn’t clear; this is my setup:

That shouldn’t affect Discourse, as far as I understand?


(Sam Saffron) #8

it clearly is having some sort of effect, disable that template you should be ok.


(Bart) #9

Ok! Just to be clear: what does ‘disabling a template’ mean exactly? Delete it? Will it return when I update Discourse, or is there a more permanent way to achieve this?

Thanks!


(Sam Saffron) #10

It means removing this line:

from your copy which is usually called app.yml

and doing:

cd /var/discourse
git pull
./launcher rebuild app

Too many 429 errors
Migration using the API
(Jeff Atwood) #11

I think CloudFlare is incorrectly causing all your hits to be reported as from the same IP as well.

edit: nm, saw you are not passing Discourse through CF. So that’s good :slight_smile:

I’d also suggest editing the values in the rate limiting template rather than disabling it, otherwise you become vulnerable to DDoS and overload etc.


(Jake Shadle) #12

We just started getting problems with our Discourse instance the past couple of days, and took us a little bit of time to track it down to those nginx rate limits in templates/web.ratelimited.template.yml once we noticed that the browser was getting a ton of 429 responses from the server.

Couple of notes that would have made this less problematic:

  • Calling out the rate limiting that is done in nginx before requests even make it to Discourse in the Advanced Install Guide
  • Adding a note in the Settings=>Rate Limits to indicate you may need to change or remove templates/web.ratelimited.template.yml due to the differences between the HTTP limiting and the Discourse application rate limiting
  • The default of 12 requests/sec seems incredibly low, we are able to hit that with just 40 users.

Anyway, that little hurdle is behind us, thanks for the great work!


(Jeff Atwood) #13

Probably the difference here is that these users are all coming from the same IP as your Discourse is internal?

Also if you need any Star Wars Battlefront testers, have I mentioned that many members of the Discourse team have high end video cards and we are excellent testers? :wink:


(Jake Shadle) #14

Yes, the instance and the users are all on the internal network, and we have another nginx server (no rate limiting at all) on the same machine that is forwarding the Discourse requests to Discourse’s nginx, so not sure how all that comes together to get overloaded with so few users. But like I said it isn’t a problem anymore after bumping the default limits and rebuilding the app, was just kind of confusing for us until we found this topic. :smile:

Luckily there are some guys on Frostbite that have pull with the people who pull the strings, one of which was one of the big proponents for setting up Discourse for us in the first place, so we might be able to work something out whenever Battlefront starts going into Alpha/Beta and sends out play test invites to wider groups! But that won’t be for a few months yet, though. :wink:


(Kane York) #15

By the way, I think you can also set it up so that the inner nginx trusts the IP of the outer nginx in the X-Forwarded-For header, chaining up to the front proxy.


(Paolo G. Giarrusso) #16

How? That sounds great & what I need!


(Paolo G. Giarrusso) #17

Discourse rate-limiting behind a reverse proxy: my best guess (tested once)

I think I figured that out, and I deployed this and it seems to work, but I’ve learned half of this in the last 10 minutes and tested it on one instance, so beware. It also seems that moving the rate-limiting in the reverse-proxy could be more efficient, but that’s not a change I’d want to do live and I’m not sure that matters.

The change to the template (/var/discourse/templates/web.ratelimited.template.yml) would be

-limit_req_zone $binary_remote_addr zone=flood:10m rate=$reqs_per_secondr/s;
-limit_req_zone $binary_remote_addr zone=bot:10m rate=$reqs_per_minuter/m;
+limit_req_zone $http_x_forwarded_for zone=flood:100m rate=$reqs_per_secondr/s;
+limit_req_zone $http_x_forwarded_for zone=bot:100m rate=$reqs_per_minuter/m;

With this change, the rate limiting uses X-Forwarded-For as key in the rate limiting hashtable, rather than the IP addresses of the direct HTTP client in binary form. References: rate limiting docs, http_* variable docs, your reverse proxy configuration, and docs for $proxy_add_x_forwarded_for, since this is used to set X-Forwarded-For.
Note I increased the size of the hash table out of caution (probably overly much), because the keys are bigger (a list of IP addresses in text form, rather than one binary one) and the docs mention the key size matters (Module ngx_http_limit_req_module). By doing the change on the reverse proxy you could avoid that.

Rebuilding the instance after you start getting a usage burst is not perfect, so I also did the change live. I don’t recommend this unless you know what you’re doing, and I’m not sure I knew it myself. In any case, also modify the template, if you don’t want to destroy the change next time you rebuild.
(In particular, I needed this after telling ~400 people in person to register — all students of a 1st year CS course).
Since when you’re doing this you might not be able to afford rebuilding the instance, I also did the change live on the instance.

# docker exec -it info1-discourse env TERM=xterm /bin/bash
# vi /etc/nginx/conf.d/discourse.conf

change

limit_req_zone $binary_remote_addr zone=flood:10m rate=12r/s;
limit_req_zone $binary_remote_addr zone=bot:10m rate=200r/m;

to

limit_req_zone $http_x_forwarded_for zone=flood:100m rate=12r/s;
limit_req_zone $http_x_forwarded_for zone=bot:100m rate=200r/m;

reload nginx with

nginx -s reload

and retest your website. EDIT: fixed typo in code, sorry!


429 too many connections issue with NGINX in front of NGINX
Rate Limiting when behind Nginx Proxy
(Christopher Heald) #18

Can you confirm, now that you have had more than 10 minutes of testing, that your rate-limited whitelisting method is a long-term workable solution?


(Paolo G. Giarrusso) #19

If you’re talking about my setup (though I’m not doing any whitelisting), I’ve had no more problems with this setup. I guess I oversized the rate limiting hash table, but that seems no problem — the memory usage (RSS) of nginx’s master process is < 6MB. I could provide more info on the scale, but I’m not sure what stats to quote.


(Danny Goodall) #20

Sorry to bump a very old thread, but just in case any intrepid Googlers land here - as I did…

I’d go so far as to suggest that you have to edit the values in the rate limiting template.

At least that is what my experience suggests. I’ve commented out the template in app.yml, rebuilt the app and then got 502 bad gateway.

Editing the rate-limiting variables in the templates instead has increased the limits successfully.


Migration using the API