Email Hostname Certificate Mismatch Causing sidekiq Queue Overload, Severe Site Instability

Nice job @RGJ!

While we anticipate a fix, on a side note, it would be good if this problem didn’t cause the cascade of issues that I experienced, which nearly brought by forum down completely. Specifically:

  • The email failures seem to be retried extremely quickly, which causes the sidekiq queue to explode in size and ~100% CPU usage caused by these tasks
  • In addition, something (either crashes or restarts) was causing Redis to write enormous tmp files, I assume containing the state of the sidekiq queue. While these were safe to remove, they quickly filled the disk, which cause more crashes, and so on. I had some other disk space that I was able to free so that I could restart the forum and figure out what was going on, but this may not be true for everyone. (It’s also somewhat hard to confirm that, in this case, the Redis tmp files are in fact safe to delete.)

My guess is that the simplest solution here is to slow down the retry on failed email jobs—or at least on ones that don’t have timeliness constraints like password resets. Which seems appropriate given that email problems are unlikely to resolve quickly, and most / all mailers will do their own retries once they receive a message.

8 Likes