`/srv/status` monitoring endpoint doesn't catch some service unavailability issues - one example free space

DeanMarkTaylor · April 23, 2017, 7:12pm

Situation

An existing site monitored by the /srv/status endpoint today had a free space issue (i.e. there was none).

The monitoring endpoint /srv/status continued to return “200 OK” whilst a Homepage request / normal user requests returned “500 Internal Server Error”.

Feature request

It would be good if the monitoring endpoint could also run Redis checks that would capture the failure case where errors in services like Redis writing to disk would error.

To be clear here, I'm not asking for a free space check. Really I'm expecting if the homepage and other user requests are all failing that the monitoring endpoint `/srv/status` should also fail.

Possible Workaround

Monitor both homepage and /srv/status monitoring endpoint?

Further detail

Looking at /logs/railed/production.log:

/srv/status requests look like this returning “200 OK”:

Started GET "/srv/status" for 85.x.x.xx at 2017-04-23 17:53:35 +0000
Processing by ForumsController#status as HTML
  Rendered text template (0.0ms)
Completed 200 OK in 3ms (Views: 1.2ms | ActiveRecord: 0.0ms)
Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Whilst homepage requests return “500 Internal Server Error”:

Started GET "/" for 90.x.x.x at 2017-04-23 17:53:00 +0000
Processing by ListController#latest as HTML
Completed 500 Internal Server Error in 214ms (ActiveRecord: 56.5ms)
Redis::CommandError (MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.)
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.3/lib/redis/client.rb:121:in `call'
Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

mpalmer · April 23, 2017, 9:32pm

There are better tools out there for monitoring system-level things like disk space (and memory, and CPU, and disk I/O, and…). Reimplementing all of that logic inside Discourse seems like a bad idea.

DeanMarkTaylor · April 23, 2017, 9:41pm

It’s merely that redis and related services are fully functional.

I.e. I would also expect this check to fail if the homepage and other user requests were failing.

I don’t believe there should be an free space check.

mpalmer · April 23, 2017, 9:45pm

DeanMarkTaylor · April 23, 2017, 9:53pm

Fixed?

EDIT:
Clarification point also added to original post.

codinghorror · April 24, 2017, 3:23am

I dunno @mpalmer I agree that if the homepage is 500-ing /srv/status ought to catch it, if we can do it by checking Redis et al.

mpalmer · April 24, 2017, 3:31am

Sure, we can (and should) implement generic “is external service X available to use?” code in /srv/status. The situation is greatly complicated by the master/slave fallback code, though I’m sure with enough head-scratching it could be figured out.

What I’m not in favour of is reimplementing a system-level monitoring system inside Discourse (let alone putting it in /srv/status). If you’re running a production system, then you should have a real monitoring system in place, and if you don’t, you’re going to have a bad time.

sam · April 25, 2017, 10:09pm

Thing is /srv/status IS making sure the app can talk to redis, it indeed can, but it can not write to redis. Forcing it now to do extra work to ensure you can also write to redis is increasing the scope here.

In quite a few cases we NEED /srv/status to return - everything is OK - even if redis is in readonly.

I am open to adding a param here if you want to do a PR

/srv/status?super-maximum-check-mode=true

In particular:

/srv/status?full-check could return JSON

{
   "redis-can-read": true,
   "redis-can-write": true,
   "sql-can-read": true,
   "sql-can-write": true,
   "disk-space-ultra-low": true
}

mpalmer · April 25, 2017, 10:10pm

When @Sam and I agree on something, you know it’s srs bznz…

codinghorror · April 25, 2017, 10:15pm

Wait wait wait wait wait wait.

Why is “writing to redis” out of scope? That’d be like saying “confirming that we can write to the database is out of scope!”

Do you mean writing to disk? That I can agree with.

sam · April 25, 2017, 10:20pm

I am uncertain if we call $redis.set('xyz', 1) in our middleware pipeline, its quite possible we do I would just have to trace it down. Pretty certain we call $redis.get I absolutely will not risk any changes of behavior here. For one our haproxy load balancer decides which servers are in rotation based on the reply here.

If for example I mess with this, and then deploy, and then a site goes readonly (cause it falls back to redis slave that can not write to disk)… well … maybe the whole site will vanish and show an error page, which kind of defeats a whole bunch of redundancy we have built in to our hosting.

I am not fussing with this, if anyone wants a more advanced srv status, go for it, but it should be optional and not change the default behavior.

codinghorror · April 25, 2017, 10:21pm

So the assumption is it is OK for all Redis writes to fail, the app should continue working, no big deal, couldn’t write to Redis, don’t care?

Not talking about Redis writing its in memory key-value pairs to disk… I’m talking about writing a new key in memory, to Redis.

sam · April 25, 2017, 10:23pm

the operation $redis.set('xyz', 1) returns a particular exception when you are a slave, regardless of disk space.

I am just SUPER EXTRA DUPER nervous about mucking with this cause it can pretty much take our entire hosting down.

codinghorror · April 25, 2017, 10:27pm

I think it’s a legit complaint, but not worth that risk, for sure.

mpalmer · April 26, 2017, 2:18am

Taking the discussion on a bit of a tangent, if people are feeling maxtreme YOLO, they could run Redis with stop-writes-on-bgsave-error no. Unfortunately, there’s a whole pile of stuff we store in Redis that makes a mess if it gets lost, like the auth token secret, and sidekiq queues, so not reliably persisting Redis data is… kinda risky. But, then again, we’re not using AOF, so any number of changes could be lost anyway (not too bad for the auth token secret, because it shouldn’t change often, but sidekiq jobs… yikes).

Topic		Replies	Views
`/srv/status` returns OK even if database is broken Dev	6	652	July 18, 2020
Health check API Feature	9	2405	January 28, 2020
How to test /srv/status Support	1	691	March 17, 2021
What URL should we monitor to be sure Discourse is up Support	3	1534	April 25, 2016
Webhook for Discourse Uptime Monitoring? Dev	9	1713	December 12, 2017

`/srv/status` monitoring endpoint doesn't catch some service unavailability issues - one example free space

Situation

Feature request

Possible Workaround

Further detail

Related topics