`/srv/status` monitoring endpoint doesn't catch some service unavailability issues - one example free space

Situation

An existing site monitored by the /srv/status endpoint today had a free space issue (i.e. there was none).

The monitoring endpoint /srv/status continued to return “200 OK” whilst a Homepage request / normal user requests returned “500 Internal Server Error”.

Feature request

It would be good if the monitoring endpoint could also run Redis checks that would capture the failure case where errors in services like Redis writing to disk would error.

To be clear here, I'm not asking for a free space check. Really I'm expecting if the homepage and other user requests are all failing that the monitoring endpoint `/srv/status` should also fail.

Possible Workaround

Monitor both homepage and /srv/status monitoring endpoint?


Further detail

Looking at /logs/railed/production.log:

/srv/status requests look like this returning “200 OK”:

Started GET "/srv/status" for 85.x.x.xx at 2017-04-23 17:53:35 +0000
Processing by ForumsController#status as HTML
  Rendered text template (0.0ms)
Completed 200 OK in 3ms (Views: 1.2ms | ActiveRecord: 0.0ms)
Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Whilst homepage requests return “500 Internal Server Error”:

Started GET "/" for 90.x.x.x at 2017-04-23 17:53:00 +0000
Processing by ListController#latest as HTML
Completed 500 Internal Server Error in 214ms (ActiveRecord: 56.5ms)
Redis::CommandError (MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.)
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.3/lib/redis/client.rb:121:in `call'
Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

2 Likes

There are better tools out there for monitoring system-level things like disk space (and memory, and CPU, and disk I/O, and…). Reimplementing all of that logic inside Discourse seems like a bad idea.

1 Like

It’s merely that redis and related services are fully functional.

I.e. I would also expect this check to fail if the homepage and other user requests were failing.

I don’t believe there should be an free space check.

Fixed?

EDIT:
Clarification point also added to original post.

I dunno @mpalmer I agree that if the homepage is 500-ing /srv/status ought to catch it, if we can do it by checking Redis et al.

1 Like

Sure, we can (and should) implement generic “is external service X available to use?” code in /srv/status. The situation is greatly complicated by the master/slave fallback code, though I’m sure with enough head-scratching it could be figured out.

What I’m not in favour of is reimplementing a system-level monitoring system inside Discourse (let alone putting it in /srv/status). If you’re running a production system, then you should have a real monitoring system in place, and if you don’t, you’re going to have a bad time.

10 Likes

Thing is /srv/status IS making sure the app can talk to redis, it indeed can, but it can not write to redis. Forcing it now to do extra work to ensure you can also write to redis is increasing the scope here.

In quite a few cases we NEED /srv/status to return - everything is OK - even if redis is in readonly.

I am open to adding a param here if you want to do a PR

/srv/status?super-maximum-check-mode=true

In particular:

/srv/status?full-check could return JSON

{
   "redis-can-read": true,
   "redis-can-write": true,
   "sql-can-read": true,
   "sql-can-write": true,
   "disk-space-ultra-low": true
}
2 Likes

When @Sam and I agree on something, you know it’s srs bznz…

4 Likes

Wait wait wait wait wait wait.

Why is “writing to redis” out of scope? That’d be like saying “confirming that we can write to the database is out of scope!”

Do you mean writing to disk? That I can agree with.

I am uncertain if we call $redis.set('xyz', 1) in our middleware pipeline, its quite possible we do I would just have to trace it down. Pretty certain we call $redis.get I absolutely will not risk any changes of behavior here. For one our haproxy load balancer decides which servers are in rotation based on the reply here.

If for example I mess with this, and then deploy, and then a site goes readonly (cause it falls back to redis slave that can not write to disk)… well … maybe the whole site will vanish and show an error page, which kind of defeats a whole bunch of redundancy we have built in to our hosting.

I am not fussing with this, if anyone wants a more advanced srv status, go for it, but it should be optional and not change the default behavior.

So the assumption is it is OK for all Redis writes to fail, the app should continue working, no big deal, couldn’t write to Redis, don’t care?

Not talking about Redis writing its in memory key-value pairs to disk… I’m talking about writing a new key in memory, to Redis.

the operation $redis.set('xyz', 1) returns a particular exception when you are a slave, regardless of disk space.

I am just SUPER EXTRA DUPER nervous about mucking with this cause it can pretty much take our entire hosting down.

I think it’s a legit complaint, but not worth that risk, for sure.

Taking the discussion on a bit of a tangent, if people are feeling maxtreme YOLO, they could run Redis with stop-writes-on-bgsave-error no. Unfortunately, there’s a whole pile of stuff we store in Redis that makes a mess if it gets lost, like the auth token secret, and sidekiq queues, so not reliably persisting Redis data is… kinda risky. But, then again, we’re not using AOF, so any number of changes could be lost anyway (not too bad for the auth token secret, because it shouldn’t change often, but sidekiq jobs… yikes).

2 Likes