Site goes down at the same time every day in memory-constrained environment

I don’t think the nginx inside the container has diverged from stock at all, I use the provided launcher script for all the docker stuff.

This is what the host nginx config for the site looks like if you’re curious, but I don’t think this is the problem:

root@selectbutton:/etc/nginx/sites-enabled# cat discourse

server {

listen 443 ssl;
server_name selectbutton.net;
ssl_certificate /etc/letsencrypt/live/selectbutton.net/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/selectbutton.net/privkey.pem;

location /basic_status {
    stub_status on;
    access_log off;
    allow 127.0.0.1;
    allow 172.31.20.111;
    allow 52.35.138.13;
    deny all;
}

location / {
    root /var/www/discourse-root;
    try_files $uri @discourse;
}

location @discourse {
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_pass https://localhost:4443;
}
}

server {
listen 80;
server_name selectbutton.net;
return 301 https://$host$request_uri;
}

I’m on a server with 2 GB of ram and I’m seeing the same behavior as of a few days ago. Seems to happen around backup time.

Well you would be using this:

And if I had to bet a :money_with_wings: you do not have a template setting:

real_ip_header X-Forwarded-For;

The internal NGINX is going to have to trust that header.

2 Likes

Ah, OK, good to know. So I should add that to the host nginx or the container nginx (which is to say, that to: block in the app.yml?)

You need to run a replace command in the internal NGINX.

After making the changes, look at the actual file on disk to confirm that you got what you wanted.

A trick I use is:

  1. ./launcher enter app
  2. cd /etc/nginx/conf.d/
  3. edit the discourse.conf file
  4. sv restart nginx
  5. see that my desired effect was achieved
  6. turn that change into a yml replace command
  7. rebuild
  8. confirm file is good
2 Likes

OK, will test if it goes down again tomorrow despite increasing swap. Any other ideas if this turns out not to be it? The timeliness still makes me quite skeptical.

Yeah, the strict periodicity (in the absence of scheduled backups) is a very confusing aspect to this. Seems like a heck of a big clue, if you can figure out what’s running at that exact time, I think you’ll be 90% done.

3 Likes

Same problem for me :frowning:
I’m tired of rebuild after every crash.

Check your /sidekiq and see if anything is running at that time.

What also might help is to start psql to take a look at what is going on in your database at that time.

SELECT pid, age(query_start, clock_timestamp()), usename, query FROM pg_stat_activity;

2 Likes

Sidekiq is idle :thinking:

And of course it seems not to have happened today after occurring several days in a row at this exact time. Huh. swap usage hasn’t ticked over 900M either. Oh well! Stay tuned I guess.

The error 429 means you are hitting a rate limit, so no amount of additional resources are going to help you there…

2 Likes

It’s not always a 429 – throws 408’s too.

First you need to fix the issue where IP addresses are not reported correctly to the app, as stated upstream in this topic…

At the moment I’m failing to work out where I can add it to the container nginx so that it’ll actually have an effect; the container’s access log is always reporting the same IP from every hit after each time I’ve tried adding it to a different block of the discourse nginx and restarting. the nginx docs suggest that module has to be built separately but I’m assuming it already is in your docker distribution?

Also, curious whether this has changed upstream semi-recently; as recently as Oct 10 (when I was probably not more than a month or so behind on updates) the Discourse site was able to identify external IPs other than localhost and report them properly, and our nginx configs haven’t changed in over a year.

OK, yeah, it’s not a build issue, nginx -V claims it’s in there. Can’t see anything weird in the past few months of commits to discourse-docker either.

Yeah, still happening :frowning:

To everyone’s knowledge, should a dropped-in real_ip_header X-Forwarded-For; in the discourse {} block of the container’s nginx/conf.d/discourse.conf suffice for IP forwarding, and therefore might something else be going wrong if that’s not working? And is it strange that the discourse app has been able to detect requester IPs (e.g. for banning purposes) in the past with these same configs, if the container’s /var/log/nginx/access.log definitely can’t tell them apart now, no matter where I insert that config line?

1 Like

OK, I ran this by a friend who has spent more time looking at nginx configs than I have, and they were able to point out that I also needed set_real_ip_from 172.16.0.0/12; for the container nginx in addition to the real_ip_header X-Forwarded-For;. So at least that should stop the 429’s. I’ll see if the problem still recurs.

3 Likes

To swerve back to the original problem for a moment, do you happen to have any recollection as to when you last upgraded Docker? I’ve just tracked down a bug in Docker 17.11 which has the definite potential to cause the sorts of problems you reported. See the above post for more details and instructions on how to downgrade Docker to a working version.

2 Likes

root@selectbutton:~# docker --version
Docker version 17.05.0-ce, build 89658be

but I appreciate you checking in! looks like it may have in fact just been the header forwarding issue in my case.

2 Likes