Don't use sidekiq queue to process "problems have been found" email

We’ve been seeing some random sidekiq crashes recently, hopefully fixed by https://github.com/discourse/discourse/commit/f662d1135e0dced66b2c5550e59e3fd958d18854

One problem is there was nothing obvious that told us this had happened. We’re working on improving our monitoring setup so someone gets a phone call when this happens, but I expect most communities don’t/won’t have that sort of support.

After restarting sidekiq the most recent time, a “problems have been found” PM came through alerting me that sidekiq was down. The problem being, it wasn’t sent while sidekiq was down, because it got stuck in the queue!

As it happens, the other times sidekiq went down and I brought it back up there was no such PM. I’m not sure why that was. It’s possible this PM was alerting me instead to the fact an update check hadn’t been done because the queue was so saturated after bringing sidekiq back up after a weekend of it being down.

If an email was sent synchronously to the DISCOURSE_DEVELOPER email, or contact email, or whatever makes most sense as soon as problems are detected this would mean much more resilient alerting on sites that don’t have the time to build a fully-fledged monitoring setup.

…unless of course SMTP is also not working, so perhaps as well as this, there should be a synchronously triggered webhook for those instances that do have some sort of third-party monitoring set up.

3 Likes

I am not entirely following – doesn’t sidekiq automatically restart after crashing, regardless? We didn’t see any epidemic problems on our hosting, or in supporting self installs.

1 Like

It hasn’t been for us - which may point to it being another problem - we’ve been having to restart our containers to bring it back up. We’re still investigating.

I think this feature request stands alone, however. Taking as many things as possible that might go wrong out of the path from problem to alert seems like a good idea to me.