An existing site monitored by the /srv/status endpoint today had a free space issue (i.e. there was none).
The monitoring endpoint /srv/status continued to return “200 OK” whilst a Homepage request / normal user requests returned “500 Internal Server Error”.
Feature request
It would be good if the monitoring endpoint could also run Redis checks that would capture the failure case where errors in services like Redis writing to disk would error.
To be clear here, I'm not asking for a free space check.
Really I'm expecting if the homepage and other user requests are all failing that the monitoring endpoint `/srv/status` should also fail.
Possible Workaround
Monitor both homepage and /srv/status monitoring endpoint?
Further detail
Looking at /logs/railed/production.log:
/srv/status requests look like this returning “200 OK”:
Started GET "/srv/status" for 85.x.x.xx at 2017-04-23 17:53:35 +0000
Processing by ForumsController#status as HTML
Rendered text template (0.0ms)
Completed 200 OK in 3ms (Views: 1.2ms | ActiveRecord: 0.0ms)
Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
Whilst homepage requests return “500 Internal Server Error”:
Started GET "/" for 90.x.x.x at 2017-04-23 17:53:00 +0000
Processing by ListController#latest as HTML
Completed 500 Internal Server Error in 214ms (ActiveRecord: 56.5ms)
Redis::CommandError (MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.)
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.3/lib/redis/client.rb:121:in `call'
Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
There are better tools out there for monitoring system-level things like disk space (and memory, and CPU, and disk I/O, and…). Reimplementing all of that logic inside Discourse seems like a bad idea.
Sure, we can (and should) implement generic “is external service X available to use?” code in /srv/status. The situation is greatly complicated by the master/slave fallback code, though I’m sure with enough head-scratching it could be figured out.
What I’m not in favour of is reimplementing a system-level monitoring system inside Discourse (let alone putting it in /srv/status). If you’re running a production system, then you should have a real monitoring system in place, and if you don’t, you’re going to have a bad time.
Thing is /srv/statusIS making sure the app can talk to redis, it indeed can, but it can not write to redis. Forcing it now to do extra work to ensure you can also write to redis is increasing the scope here.
In quite a few cases we NEED /srv/status to return - everything is OK - even if redis is in readonly.
I am open to adding a param here if you want to do a PR
I am uncertain if we call $redis.set('xyz', 1) in our middleware pipeline, its quite possible we do I would just have to trace it down. Pretty certain we call $redis.get I absolutely will not risk any changes of behavior here. For one our haproxy load balancer decides which servers are in rotation based on the reply here.
If for example I mess with this, and then deploy, and then a site goes readonly (cause it falls back to redis slave that can not write to disk)… well … maybe the whole site will vanish and show an error page, which kind of defeats a whole bunch of redundancy we have built in to our hosting.
I am not fussing with this, if anyone wants a more advanced srv status, go for it, but it should be optional and not change the default behavior.
Taking the discussion on a bit of a tangent, if people are feeling maxtreme YOLO, they could run Redis with stop-writes-on-bgsave-error no. Unfortunately, there’s a whole pile of stuff we store in Redis that makes a mess if it gets lost, like the auth token secret, and sidekiq queues, so not reliably persisting Redis data is… kinda risky. But, then again, we’re not using AOF, so any number of changes could be lost anyway (not too bad for the auth token secret, because it shouldn’t change often, but sidekiq jobs… yikes).