Email Hostname Certificate Mismatch Causing sidekiq Queue Overload, Severe Site Instability

I’ve been self-hosting Discourse for many years, and had several instances happily configured and running on a fairly beefy machine.

Today I noticed that one of my forums had gone down. The initial culprit looked to be lack of disk space, which I fixed. I then restarted the Discourse instance.

However, it’s continued to go down regularly since then. Each time I boot it, I immediately see sidekiq go crazy and a huge number of failed email jobs, which are also causing redis to store a massive amount of state, which I think was the actual cause of the disk space problem. (I’m about to do a flush next time I can bring the machine up, since if I don’t I’ll quickly be out of space on this machine and won’t be even able to start Discourse to flush it. But the flush doesn’t seem to reduce redis disk usage much.)

The error message indicates something regarding a certificate name mismatch, which I find a bit surprising since the mail server I’m using is internal and doesn’t require TLS or authentication. I was able to verify on one of my other instances (using the same email configuration) that email had stopped working. All I can see in the main production logs is a 422 error, but when I send something like a password reset I see a similar error in the sidekiq logs:

Jobs::HandledExceptionWrapper: Wrapped OpenSSL::SSL::SSLError: SSL_connect returned=1 errno=0 state=error: certificate verify failed (Hostname mismatch)

I have been able to verify that I can send email via the command line, so this does not seem to be a problem with the email server itself, just something broken with the Discourse configuration.

Here’s the original mail configuration that was working until recently:

DISCOURSE_SMTP_ADDRESS: outbound-relays.techservices.illinois.edu
DISCOURSE_SMTP_PORT: 25
DISCOURSE_SMTP_ENABLE_START_TLS: false

Again, this mailserver is internal and doesn’t require a username or password, and these settings were working until recently. I’ve been experimenting with DISCOURSE_SMTP_OPENSSL_VERIFY_MODE, but I can’t tell if it actually is still supported. Regardless, it doesn’t seem to help. I noticed a few new email settings that were added since I set up these forums, but they don’t seem needed given this mail server’s configuration.

Any help would be appreciated! At this point I’m honestly even having a hard time being sure of what is wrong or iterating, since rebuilding the container takes a while and the error message in the production logs only has the 422 error and I can’t figure out where to look for the actual root cause. (It must be somewhere, right? I’m sure I’m just missing it.)

1 Like

As an update, following the advice in another thread, this command successfully sends email from inside the Docker container:

echo message | s-nail -r "noreply@myforum.com" -s testing -S "smtp=same.email-service.com:25" my@address.com

Which lines up with the email configuration I was using when this problem started. Note that I also performed an upgrade to the latest Discourse via a (required) command line pull on Friday, which makes me wonder if a recent commit brought in this problem.

2 Likes

When if the last time you rebuilt the container?

Also, you did clear the redis queue?

2 Likes

Friday AM I believe. A normal update through the UI triggered the need for a launcher app rebuild. When I examined the sidekiq logs later, it appeared that the backlog started around the time that the container was rebuilt, but it took around 24 hours for the Redis logs to eat up all available storage on the host and actually cause downtime. However, the forum was probably slow leading up to that point, given that sidekiq was desperately trying to resend an increasing number of failed email jobs at 100% CPU usage.

Yes.

However, it worries me that this does not seem to have reduced Redis disk usage. I have a redis_data folder that is currently 29G in size, even after the flush. Perhaps Redis is like MongoDB in that it can be tough to get it to return disk allocations? Given that this is 1/3 of the available disk on the machine, it will become a problem, but I’ll defer that one for now in favor of just getting email working again.

1 Like

As a debugging note: Is there a way to send a test email from the command line inside the container, using the same codeflow as would be used by Discourse? (Meaning, not from the command line using another tool, which I’ve already verified works.) This would be helpful for debugging, since currently issuing a test email requires fiddling with the web UI and then digging around in the logs to figure out what went wrong. (And so far only finding the 422 errors, and not anything more useful, except in the sidekiq logs which aren’t created when using the test email flow.) Or perhaps the test email UI could surface more debugging information?

Overall I suspect that most people set up Discourse and don’t get to this point without email working, since it is needed to send initial invites and so on. But I’m finding the debuggability limited in the case where email was working and suddenly stops. (Also the retry logic may need some tuning, since it seems awfully fast to retry in this case. A certificate error is probably unlikely to be fixed a few seconds after the initial attempt…)

1 Like

Maybe see Troubleshooting email on a new Discourse install. I think you want

 rake emails:test[user@domain]
3 Likes

Thanks! This is helpful. Here’s the result:

Testing sending to user@domain using outbound-relays.techservices.illinois.edu:25, username: with plain auth.
======================================== ERROR ========================================
                                    UNEXPECTED ERROR

SSL_connect returned=1 errno=0 state=error: certificate verify failed (Hostname mismatch)

====================================== SOLUTION =======================================
This is not a common error. No recommended solution exists!

Please report the exact error message above to https://meta.discourse.org/
(And a solution, if you find one!)
=======================================================================================

I’m going to rebuild the container now to make sure that it and app.yml are in sync. But overall I’m a bit confused why it says it is using plain auth, since neither a username nor password are provided in the app.yml configuration file.

Is it worth recategorizing this as a bug? I was hesitant initially, since it’s email and there are lots of ways that this could be misbehaving, many of which would be some combination of my fault / external changes. But AFAICT this represents a configuration that was working for several years now and suddenly stopped on an upgrade to the latest edition of discourse_docker. Is it possible that something about how the configuration files are being processed changed recently?

WRT the error message itself—I was able to pull a certificate for that machine and, indeed, the certificate lists another hostname (a different CNAME for the same machine). However, the certificate itself is several years old, and also expired around a year ago, but just started throwing this error recently. So that makes me think that it was not a change to the certificate that is causing the problem.

2 Likes

When I connect to that host and test the STARTTLS I get a certificate that does not match the hostname:

Certificate chain
 0 s:/C=US/ST=California/L=Sunnyvale/O=Proofpoint, Inc./OU=ESP/CN=*.pphosted.com
   i:/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=Thawte RSA CA 2018
 1 s:/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=Thawte RSA CA 2018
   i:/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert Global Root CA

and it did not expire yet:

notBefore=Jun 12 00:00:00 2020 GMT
notAfter=Sep 14 12:00:00 2022 GMT

Doing a forward and reverse lookup shows that the mail servers are actually called mx0a-00007101.pphosted.com and mx0b-00007101.pphosted.com

outbound-relays.techservices.illinois.edu. 22 IN A 148.163.139.28
outbound-relays.techservices.illinois.edu. 22 IN A 148.163.135.28

28.139.163.148.in-addr.arpa name = mx0b-00007101.pphosted.com.
28.135.163.148.in-addr.arpa name = mx0a-00007101.pphosted.com.

Try to change the hostname you connect to to one of those instead of the .edu name. It does not need to be a change to the certificate, it might have been a change to the hostname or to the code. But the error is correct: there is indeed a hostname certificate mismatch.

4 Likes

Thanks @RGJ! I’ll give that a try.

However, I’m a bit nervous about using those names, given that they could be subject to change in the future and don’t match the hostname that is provided for on-campus use for this purpose. Is there any way to disable this error via app.yml settings or in some other way?

1 Like

My approach was to get things back working first, then figure out how to make it better.

You should be able to set DISCOURSE_SMTP_OPENSSL_VERIFY_MODE to false, but you said you already tried that.

5 Likes

Yeah, absolutely! That makes sense.

I think I tried setting that value to none, but not to false. I’ll try false.

2 Likes

OK, can confirm that false does not work. Will try none again.

1 Like

Can also verify that none does not work.

I guess I’m a bit stumped here as to whether this is reasonable behavior. DISCOURSE_SMTP_ENABLE_START_TLS is set to false, which at least to the non-email-experts like myself would cause it to be confusing that a certificate is playing a role in this failure. If the machine did not have a certificate at all, would this same problem occur? (Obviously I can’t test this.) If not, it seems even more odd.

Anyway, I’ll go with the temporary fix for now, but something about this seems odd to me.

1 Like

Certainly. I can imagine that if a mail server requires starttls it will override the starttls setting but DISCOURSE_SMTP_OPENSSL_VERIFY_MODE should still be able to prevent an error.

Is anyone able to repro this?

2 Likes

@Geoffrey_Challen how did you fix it?

Today I have update my forum to 2.9.0.beta4 (c99a6b10fb) and now I have the same error, discourse cannot send emails:
SSL_connect returned=1 errno=0 state=error: certificate verify failed (Hostname mismatch)

I have not changed the configuration of the VPS and email!

My app.yml:

  DISCOURSE_SMTP_ADDRESS: smtp.mydomain.info
  DISCOURSE_SMTP_PORT: 25
  DISCOURSE_SMTP_USER_NAME: info@mydomain.info
  DISCOURSE_SMTP_PASSWORD: "mypassword"
  DISCOURSE_SMTP_ENABLE_START_TLS: false           # (optional, default true)
  DISCOURSE_SMTP_DOMAIN: mydomain.info             # (required by some providers)
  #DISCOURSE_NOTIFICATION_EMAIL: noreply@discourse.example.com    # (address to send notifications from)

Tried and nothing changes …

Please now i can’t send emails and i can’t use TLS, what can i do?

2 Likes

Issue this command and see for what hostname the certificate is for

openssl s_client -connect  smtp.mydomain.info:25 -starttls smtp -showcerts 2>&1|grep "depth=0"

Replacing smtp.mydomain.info with the address of your SMTP server of course.

Then try to see if you can reach the SMTP server using that hostname.

3 Likes

Thanks for your help @RGJ

hostname is CN = *.aruba.it so it’s different from mydomain.info and yes I can reach SMTP server using hostname and telnet.

Everything worked perfectly before ./launcher rebuild app

But… I have DISCOURSE_SMTP_ENABLE_START_TLS: false why does it keep looking for the certificate?

1 Like

You can access the host using a name that matches the certificate. You can ask the server administrator to add the host name that your desire to the certificate.

That’s a good question, but you can make its answer moot by following the above advice, or so I think.

Another question, I think, is why did the mail admin break it for you?

Maybe that setting worked before and now it doesn’t. Whether it’s easier to track down that big or change the the hostname and see if that solves your problem is unclear.

1 Like

No one made any changes, I’m sure, I just did ./launcher rebuild for install this plugin.

So should I change the hostname of the VPS to something that ends with .aruba.it?

1 Like

That’s what it sounds like.

It’s possible that there is a regression that’s caused the issue, but I think that you can solve your immediate issue by changing the hostname

2 Likes