Site goes down at the same time every day in memory-constrained environment

axfelix · December 18, 2017, 10:57pm

Hi,

I’m not sure what this could be, and I think running Discourse on 1GB of memory plus 1GB of swap is probably below minimum spec at this point so I’d understand if this is outright unsupported, but I’m having a weird issue where my site goes down at exactly the same time every day (at around 2:48 PST) for five minutes or so. Trying to hit it in the meantime sometimes throws 429’s.

sshing into the server and eyeballing top when this happens shows that about half of swap has been released (suggesting that maybe something else is running on the server and elbowing out Discourse, but I don’t know what it could be, there’s nothing in cron for that time of day), and most of the CPU is consumed by a single postmaster process until it runs for a little over two minutes, is killed, and then everything gradually goes back to normal.

Any idea what’s up here? Been happening for the past several releases at least.

axfelix · December 18, 2017, 11:04pm

I don’t think it could be a crawler given that I don’t know any crawlers that are that punctual…

codinghorror · December 18, 2017, 11:06pm

Per previous discussions it is almost certainly the database backup blowing out all your memory. Increase swap to 2GB.

axfelix · December 18, 2017, 11:10pm

Will do, thanks! Probably want to update your install docs to reflect that if you haven’t already.

codinghorror · December 18, 2017, 11:12pm

This is a relatively new problem cc @sam

axfelix · December 18, 2017, 11:33pm

For the record, I just checked that I have automatic backups disabled, unless it’s something else that e.g. redis does every day and isn’t exposed to the Discourse admin config. But I’ve increased swap anyway and we’ll see how tomorrow goes!

mpalmer · December 18, 2017, 11:40pm

Hi Alex,

429 is a very odd response code for the site to be sending due to memory problems. That’s normally because someone on the same IP address is doing something untoward. Is there any chance someone else on the same connection (a person or machine in your office, for example) is running some sort of scraping batch job at the same time?

axfelix · December 18, 2017, 11:43pm

Those might be a red herring – I’m running the container behind an nginx on the host so I can keep some old LAMP apps up on a couple other subdomains, and I think it just gets hit a lot by people trying to access the site when it goes down like this, with that being the result.

And I know I probably shouldn’t be doing that at or below minimum spec, so you’re free to disavow, but those haven’t changed in a dog’s age and this has only started happening with the most recent Discourse versions, always at the same time of day.

sam · December 18, 2017, 11:45pm

Well one thing I would check is that NGINX inside the container (that has the rate limiting template enabled) does not think every single user has the same IP address.

axfelix · December 18, 2017, 11:49pm

I don’t think the nginx inside the container has diverged from stock at all, I use the provided launcher script for all the docker stuff.

This is what the host nginx config for the site looks like if you’re curious, but I don’t think this is the problem:

root@selectbutton:/etc/nginx/sites-enabled# cat discourse

server {

listen 443 ssl;
server_name selectbutton.net;
ssl_certificate /etc/letsencrypt/live/selectbutton.net/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/selectbutton.net/privkey.pem;

location /basic_status {
    stub_status on;
    access_log off;
    allow 127.0.0.1;
    allow 172.31.20.111;
    allow 52.35.138.13;
    deny all;
}

location / {
    root /var/www/discourse-root;
    try_files $uri @discourse;
}

location @discourse {
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_pass https://localhost:4443;
}
}

server {
listen 80;
server_name selectbutton.net;
return 301 https://$host$request_uri;
}

yanokwa · December 19, 2017, 12:12am

I’m on a server with 2 GB of ram and I’m seeing the same behavior as of a few days ago. Seems to happen around backup time.

sam · December 19, 2017, 12:24am

Well you would be using this:

github.com

discourse/discourse_docker/blob/master/templates/web.ratelimited.template.yml

params:
  reqs_per_second: 12
  burst_per_second: 12
  reqs_per_minute: 200
  burst_per_minute: 100
  conn_per_ip: 20

run:
  - replace:
     filename: "/etc/nginx/conf.d/discourse.conf"
     from: /server.+{/
     to: |
       limit_req_zone $binary_remote_addr zone=flood:10m rate=$reqs_per_secondr/s;
       limit_req_zone $binary_remote_addr zone=bot:10m rate=$reqs_per_minuter/m;
       limit_req_status 429;
       limit_conn_zone $binary_remote_addr zone=connperip:10m;
       limit_conn_status 429;
       server {
  - replace:
     filename: "/etc/nginx/conf.d/discourse.conf"

This file has been truncated. show original

And if I had to bet a you do not have a template setting:

real_ip_header X-Forwarded-For;

The internal NGINX is going to have to trust that header.

axfelix · December 19, 2017, 12:31am

Ah, OK, good to know. So I should add that to the host nginx or the container nginx (which is to say, that to: block in the app.yml?)

sam · December 19, 2017, 12:35am

You need to run a replace command in the internal NGINX.

After making the changes, look at the actual file on disk to confirm that you got what you wanted.

A trick I use is:

./launcher enter app
cd /etc/nginx/conf.d/
edit the discourse.conf file
sv restart nginx
see that my desired effect was achieved
turn that change into a yml replace command
rebuild
confirm file is good

axfelix · December 19, 2017, 12:38am

OK, will test if it goes down again tomorrow despite increasing swap. Any other ideas if this turns out not to be it? The timeliness still makes me quite skeptical.

mpalmer · December 19, 2017, 12:43am

Yeah, the strict periodicity (in the absence of scheduled backups) is a very confusing aspect to this. Seems like a heck of a big clue, if you can figure out what’s running at that exact time, I think you’ll be 90% done.

Dmitry_Krasnoperov · December 19, 2017, 8:48am

Same problem for me
I’m tired of rebuild after every crash.

RGJ · December 19, 2017, 8:57am

Check your /sidekiq and see if anything is running at that time.

What also might help is to start psql to take a look at what is going on in your database at that time.

SELECT pid, age(query_start, clock_timestamp()), usename, query FROM pg_stat_activity;

Dmitry_Krasnoperov · December 19, 2017, 9:25am

Sidekiq is idle

axfelix · December 19, 2017, 10:51pm

And of course it seems not to have happened today after occurring several days in a row at this exact time. Huh. swap usage hasn’t ticked over 900M either. Oh well! Stay tuned I guess.

Topic		Replies	Views
Is 1GB RAM enough for both WordPress and Discourse? Installation server-resources	22	4664	June 8, 2024
Bursts of 502 Service Unavailable, pointers to debug Installation	16	1578	June 8, 2024
Troubleshooting a 429 (rate limit) Installation	20	5721	November 9, 2018
Discourse overloaded real traffic or DDOS? 100% CPU usage despite of decent traffic and high specs server Installation server-resources	18	2220	September 25, 2021
Discourse not using much RAM Installation server-resources	31	1514	August 8, 2021

Site goes down at the same time every day in memory-constrained environment

Related topics