`/srv/status` returns OK even if database is broken

I had an upgrade on a 2-container install go sideways this morning. Bootstrapping a new container migrated the database into a broken state (perhaps it wouldn’t have if I’d used SKIP_POST_DEPLOYMENT_MIGRATIONS=1, but that’s another issue), so that the running container got the “oops, this site is broken” message.

That much is expected, but my monitor is checking /srv/status and it merrily returns an OK even when rails is pretty broken.

Is this a bug? I really want my monitors to know if there is a problem, should I instead be pulling something else like /about.json (for sites that don’t require login).

1 Like

How exactly was it broken? Can you be more specific? What did the homepage look like?

Yeah, “ok” means “unicorn is working”. You can bring Postgres and Redis down and it still says “ok” if I remember correctly.

3 Likes

I’m fairly certain that’s correct. It makes sense, it’s just not what I thought.

1 Like

I’m not sure what the status of this is @sam @eviltrout? I vaguely remember a discussion of this in the past.

1 Like

Yup, doesn’t check for redis or PG. I think we use a plugin that does User.find(1) and a $redis.get instead. That still does not catch @pfaffman his case but that might be a bit too much, you can’t expect this to do a complete db consistency check.
https://github.com/discourse/discourse/blob/master/app/controllers/forums_controller.rb#L11-L17

2 Likes

The /srv/status endpoint checks only the local process, not any dependencies. “Is the HTTP stack wedged?”, plus “am I in lame-duck?”. In Kubernetes terminology, this is the livenessProbe, not the readinessProbe.

If we want to introduce a readinessProbe, it should live at a different URL.

Probably Discourse.system_user.id instead of 1.

5 Likes