Strategies for identifying and triaging email / other critical errors?

We recently encountered a problem on our Discourse where incoming emails were not being received due to (probably) an authentication problem with our email service. Unfortunately, this went on for an extended period (possibly weeks?) without anyone’s knowledge, since the users who are posting and responding to forum posts via email are not generally on the forum to SEE whether their emails are actually being posted (and, in our case, outgoing emails still worked fine). Effectively, we had a chunk of users who were totally silenced without anyone noticing - this is very very bad. :slight_smile:

The meta-problem here is: I only discovered the problem because one user happened to notice a post of theirs was missing, and I dug through the logs for 30 minutes until I found the culprit - the turnaround time here is obviously not acceptable. Is there a mechanism that I’m not aware of that would have flagged this problem immediately for admins? I would expect that repeated failures to a core service like email would (especially non-temporary failures like an auth failure…) raise a visible flag SOMEWHERE so that it could be investigated? Are there other good strategies for keeping tabs on these kinds of issues?

1 Like

So if you were using POP3 and Discourse failed to connect to your POP3 server then a warning would have shown in the admin dashboard.

If you were not using POP3 then please describe how you are delivering emails and what kind of authentication problem you’re suspecting.

1 Like