I’m not sure what this could be, and I think running Discourse on 1GB of memory plus 1GB of swap is probably below minimum spec at this point so I’d understand if this is outright unsupported, but I’m having a weird issue where my site goes down at exactly the same time every day (at around 2:48 PST) for five minutes or so. Trying to hit it in the meantime sometimes throws 429’s.
sshing into the server and eyeballing top when this happens shows that about half of swap has been released (suggesting that maybe something else is running on the server and elbowing out Discourse, but I don’t know what it could be, there’s nothing in cron for that time of day), and most of the CPU is consumed by a single postmaster process until it runs for a little over two minutes, is killed, and then everything gradually goes back to normal.
Any idea what’s up here? Been happening for the past several releases at least.
For the record, I just checked that I have automatic backups disabled, unless it’s something else that e.g. redis does every day and isn’t exposed to the Discourse admin config. But I’ve increased swap anyway and we’ll see how tomorrow goes!
429 is a very odd response code for the site to be sending due to memory problems. That’s normally because someone on the same IP address is doing something untoward. Is there any chance someone else on the same connection (a person or machine in your office, for example) is running some sort of scraping batch job at the same time?
Those might be a red herring – I’m running the container behind an nginx on the host so I can keep some old LAMP apps up on a couple other subdomains, and I think it just gets hit a lot by people trying to access the site when it goes down like this, with that being the result.
And I know I probably shouldn’t be doing that at or below minimum spec, so you’re free to disavow, but those haven’t changed in a dog’s age and this has only started happening with the most recent Discourse versions, always at the same time of day.
Well one thing I would check is that NGINX inside the container (that has the rate limiting template enabled) does not think every single user has the same IP address.
OK, will test if it goes down again tomorrow despite increasing swap. Any other ideas if this turns out not to be it? The timeliness still makes me quite skeptical.
Yeah, the strict periodicity (in the absence of scheduled backups) is a very confusing aspect to this. Seems like a heck of a big clue, if you can figure out what’s running at that exact time, I think you’ll be 90% done.
And of course it seems not to have happened today after occurring several days in a row at this exact time. Huh. swap usage hasn’t ticked over 900M either. Oh well! Stay tuned I guess.