We’ve been seeing some random sidekiq crashes recently, hopefully fixed by https://github.com/discourse/discourse/commit/f662d1135e0dced66b2c5550e59e3fd958d18854
One problem is there was nothing obvious that told us this had happened. We’re working on improving our monitoring setup so someone gets a phone call when this happens, but I expect most communities don’t/won’t have that sort of support.
After restarting sidekiq the most recent time, a “problems have been found” PM came through alerting me that sidekiq was down. The problem being, it wasn’t sent while sidekiq was down, because it got stuck in the queue!
As it happens, the other times sidekiq went down and I brought it back up there was no such PM. I’m not sure why that was. It’s possible this PM was alerting me instead to the fact an update check hadn’t been done because the queue was so saturated after bringing sidekiq back up after a weekend of it being down.
If an email was sent synchronously to the DISCOURSE_DEVELOPER email, or contact email, or whatever makes most sense as soon as problems are detected this would mean much more resilient alerting on sites that don’t have the time to build a fully-fledged monitoring setup.
…unless of course SMTP is also not working, so perhaps as well as this, there should be a synchronously triggered webhook for those instances that do have some sort of third-party monitoring set up.