Email Hostname Certificate Mismatch Causing sidekiq Queue Overload, Severe Site Instability

Nice job @RGJ!

While we anticipate a fix, on a side note, it would be good if this problem didn’t cause the cascade of issues that I experienced, which nearly brought by forum down completely. Specifically:

  • The email failures seem to be retried extremely quickly, which causes the sidekiq queue to explode in size and ~100% CPU usage caused by these tasks
  • In addition, something (either crashes or restarts) was causing Redis to write enormous tmp files, I assume containing the state of the sidekiq queue. While these were safe to remove, they quickly filled the disk, which cause more crashes, and so on. I had some other disk space that I was able to free so that I could restart the forum and figure out what was going on, but this may not be true for everyone. (It’s also somewhat hard to confirm that, in this case, the Redis tmp files are in fact safe to delete.)

My guess is that the simplest solution here is to slow down the retry on failed email jobs—or at least on ones that don’t have timeliness constraints like password resets. Which seems appropriate given that email problems are unlikely to resolve quickly, and most / all mailers will do their own retries once they receive a message.

8 Likes

In my case when I encountered the failure after the upgrade it was using TLS with a third party server and the name on the certificate matched the smtp server name. I just had one failure however. I haven’t rebooted or upgraded since to avoid further issues. I’ll try an update once the patch has been released and see how it goes.

2 Likes

I’ll start by creating a topic in #bug but since it’s technically a problem in an upstream gem I am not sure how much priority this is going to get.

3 Likes

+1 :worried:really frustrating bug

1 Like

Can’t the gem be rolled back? I would be surprised if it didn’t get attention since this is a “core” functionality, the ability to send emails and for some it’s also causing an outage due to temporary files and cpu overrunning the server. The core stability of the forum is being disrupted here.

2 Likes

Please don’t forget that this can easily be resolved by configuring your mail server properly as well.

1 Like

AFAIK my server is configured properly. Certificate name matches smtp host name, STARTTLS on port 587. I’m wondering why I faced the issue?

Thanks for opening a new ticket. Given your understanding, could you shed some light on the combinations of the two variables you pointed out in the YML file - how should they be used for different scenarios?

DISCOURSE_SMTP_ENABLE_START_TLS: true
DISCOURSE_SMTP_OPENSSL_VERIFY_MODE: none

For example I have only STARTTLS on port 587 and no other ports being used by SMTP for security reasons. Should both the variables be specified in the YML file or just one?

2 Likes

It depends.
If your SMTP server is configured correctly then you shouldn’t need either of them.

But the problem right now is that they are not doing anything at all.

Send me a PM with the name of your SMTP server and I’ll take a look and see if I can find why it’s not working for you.

2 Likes

I have a local SMTP server with no TLS/SSL support and when using StartTls=false it does not work :frowning:

1 Like

Fair point, but it’s not always our mail server. I’m using an internal mail server that is maintained by another group, and so have no control over the certificate issues or the mail server configuration.

That said, for others struggling with this, one option may be to set up your own mail server on localhost and just have it forward mail onward. Then you have control over the mail server that Discourse talks to, and your mail server on localhost can be configured to deal with whatever kind of upstream issues you might encounter. I had done this previously, but removed it at some point since it was simpler to just have Discourse talk to the upstream mail server directly. (Oops.)

1 Like

That’s why the Standard Install recommends third-party mail providers, rather than trying to use an existing or self-hosted solution.

Mail is hard for a multitude of reasons, just because something is working today doesn’t mean it’s correctly configured, only that the misconfiguration doesn’t impact the original purpose.

1 Like

The reason I picked discourse was that it supposed to be easy to install and maintain for a small self hosted deployments.

1 Like

And it is if you follow the recommendations.

If you opt to take a different path it’s not really possible to make any guarantees.

1 Like

So in summary you are saying that with discourse it is no longer possible to use an SMTP server without TLS, SSL or StartTLS?

1 Like

I don’t think anyone is suggesting that. This only relates to how the issue came about, and took time to find a root cause.

The reason we’re only seeing a handful of cases here is because of the relatively small number of installs with the updated gem that also aren’t relaying mail over some form of secured transport.

Richard has already started a topic on the bug:

For anyone who needs this working sooner they can also either enable TLS on their mail server, or temporarily switch to a mail provider which offers a secure transport.

1 Like

I do have TLS enabled with a valid certificate and matching hostname from the beginning and then I ran into the issue after the BETA 4 (461936f211) upgrade and posted the logs the topic below. Another user is also having issues and according to him his certificates are in order also:

1 Like

That’s what it sounds like to me. Some of the comments in this thread have been downright infuriating.

I self-host Docker-Discourse, and I use my Docker host as the email server. I’ve had Discourse use port 25, no TLS to deliver email via the internal Docker interface since the beginning, six years ago. This is a perfectly reasonable and perfectly safe configuration. The transactions were 100% internal to my own host. See further up-thread for my old configuration.

For me, the workaround was to do the following:

Add the host’s internal Docker interface IP address as a valid host in the DNS zone file for my domain. I.e.:

discourse-mail.jag-lovers.com A 172.17.0.1

Please note: I could just make up any hostname in the jag-lovers.com domain, since I use a wildcard certificate (CN = *.jag-lovers.com). If you don’t have a wildcard cert, be sure to use a hostname that’s a valid CN or SAN on your cert.

Please also note: The IP address I used above is the internal IP address that my host uses on the Docker interface, to talk to the Discourse-Docker container. It’s a private, non-routable IP address.

Next, change the Discourse app.yml configuration to connect to the hostname I just created, to use TLS, to connect on port 587, and to use SASL to log in to the host for each email transaction (because otherwise you’ll get a relaying denied error message).

  DISCOURSE_SMTP_ADDRESS: discourse-mail.jag-lovers.com
  DISCOURSE_SMTP_PORT: 587
  DISCOURSE_SMTP_USER_NAME: REDACTED
  DISCOURSE_SMTP_PASSWORD: "REDACTED"
  DISCOURSE_SMTP_ENABLE_START_TLS: true          # (optional, default true)
  DISCOURSE_SMTP_OPENSSL_VERIFY_MODE: none
  DISCOURSE_SMTP_DOMAIN: jag-lovers.com

Next, rebuild Discourse. That fixed the problem for me.

2 Likes

@RBoy, I appreciate the mention, but please don’t conflate the issues. I momentarily thought that I might have been affected by both issues, but was not. I am dealing only with properly-configured mail servers and am unaffected by this change.

2 Likes

I pointed it it since both issues started after the upgrade to beta4 (ruby upgrade from what I can see) and the both are related to email and there may be some correlation. It sounds like you’re still having trouble with the email jobs despite having a valid certificate.

If you are convinced they is no way they could be correlated I’ll remove the reference to your post, just DM me.

1 Like

I don’t see a way they would be correlated but also not asking you to edit. :slightly_smiling_face:

A setup on which thousands of emails flow smoothly and at the same time 30 consistently fail doesn’t sound anything like a certificate handling change.

4 Likes