We recently encountered a problem on our Discourse where incoming emails were not being received due to (probably) an authentication problem with our email service. Unfortunately, this went on for an extended period (possibly weeks?) without anyone’s knowledge, since the users who are posting and responding to forum posts via email are not generally on the forum to SEE whether their emails are actually being posted (and, in our case, outgoing emails still worked fine). Effectively, we had a chunk of users who were totally silenced without anyone noticing - this is very very bad.
The meta-problem here is: I only discovered the problem because one user happened to notice a post of theirs was missing, and I dug through the logs for 30 minutes until I found the culprit - the turnaround time here is obviously not acceptable. Is there a mechanism that I’m not aware of that would have flagged this problem immediately for admins? I would expect that repeated failures to a core service like email would (especially non-temporary failures like an auth failure…) raise a visible flag SOMEWHERE so that it could be investigated? Are there other good strategies for keeping tabs on these kinds of issues?