Discourse SMTP sends "EHLO localhost" instead of domain, breaking Google smtp-relay

Some context here: Emails have stopped sending - end of file error

Roughly a week ago (Jan 13, 2021), emails started failing to send through Google’s smtp-relay.gmail.com server (which is allowed and intended use for Google Apps users).

Sidekiq reported the failures with EOFErrors:

Jobs::HandledExceptionWrapper: Wrapped EOFError: end of file reached

And /logs reported the failed tasks as well:

Job exception: end of file reached

Backtrace available in the other post.

===================

Investigation revealed that up to date Discourse installs are connecting to SMTP relays with ‘EHLO localhost’ - and Google started rejecting these roughly a week ago.

From tcpdump on a production instance:

0x0030:  d10f f8e4 4548 4c4f 206c 6f63 616c 686f  ....EHLO.localho
	0x0040:  7374 0d0a                                st..
...
	0x0030:  de62 f0c3 3432 3120 342e 372e 3020 5472  .b..421.4.7.0.Tr
	0x0040:  7920 6167 6169 6e20 6c61 7465 722c 2063  y.again.later,.c
	0x0050:  6c6f 7369 6e67 2063 6f6e 6e65 6374 696f  losing.connectio
	0x0060:  6e2e 2028 4548 4c4f 2920 6a31 3673 6d34  n..(EHLO).j16sm4
	0x0070:  3831 3932 3976 736d 2e31 202d 2067 736d  81929vsm.1.-.gsm
	0x0080:  7470 0d0a                                tp..

And replicating with telnet gives the same result:

root@conversation:~# telnet smtp-relay.gmail.com 587
Trying 74.125.137.28...
Connected to smtp-relay.gmail.com.
Escape character is '^]'.
220 smtp-relay.gmail.com ESMTP ls8sm507258pjb.6 - gsmtp
ehlo localhost.localdomain
421 4.7.0 Try again later, closing connection. (EHLO) ls8sm507258pjb.6 - gsmtp
Connection closed by foreign host.

However, a domain-specific ehlo works properly:

root@conversation:~# telnet smtp-relay.gmail.com 587
Trying 74.125.137.28...
Connected to smtp-relay.gmail.com.
Escape character is '^]'.
220 smtp-relay.gmail.com ESMTP p10sm668563uaw.3 - gsmtp
ehlo conversation.sevarg.net
250-smtp-relay.gmail.com at your service, [64.227.96.27]
250-SIZE 157286400
250-8BITMIME
250-STARTTLS
250-ENHANCEDSTATUSCODES
250-PIPELINING
250-CHUNKING
250 SMTPUTF8

======

Based on the logs, I identified the file to modify to test the fix (in the docker image):

/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/mail-2.7.1/lib/mail/network/delivery_methods/smtp.rb

Changing

DEFAULTS = {
      :address              => 'localhost',
      :port                 => 25,
      :domain               => 'localhost.localdomain',

to

    DEFAULTS = {
      :address              => 'conversation.sevarg.net',
      :port                 => 25,
      :domain               => 'conversation.sevarg.net',

resolved the issue (after an instance restart). The EHLO is now went with the domain string, and emails now send properly from my instance.

================

Desired behavior: When sending email, the default Discourse install uses the configured domain name for the initial connection to the SMTP server. Alternately, a configuration option exists to override the domain sent. If this exists, I was unable to find it by searching.

5 Likes

I believe that I have seen this same error from other people (who may not have also been using Google Domains).

A longer-term fix is to add some magic to your app.yml that does that re-write for you. But hopefully a Real Fix will be coming instead.

If there’s a way to fix it with app.yml, I’m certainly interested - hardcoding my domain in the code to have working email is very clearly not a “proper” fix, but it does demonstrate where to resolve the issue more permanently.

Is there a reason it doesn’t simply use the configured domain for the ehlo? That’s going to be “more correct” than localhost.

Great investigative work @Syonyk!

Can you please share your app.yml file SMTP settings?

2 Likes

There’s nothing in there beyond the normal required settings.

  DISCOURSE_SMTP_ADDRESS: smtp-relay.gmail.com
  DISCOURSE_SMTP_PORT: 587
  DISCOURSE_SMTP_USER_NAME: [email username]
  DISCOURSE_SMTP_PASSWORD: [password]

Can you please try adding a new line with

DISCOURSE_SMTP_DOMAIN: conversation.sevarg.net

and try again?

3 Likes

I added that line and rebuilt the app (is there a way around that step?).

I now see “domain” in my email settings, the smtp.rb file is reverted to having localhost as the default, and emails appear to send properly - I’m able to send test messages and they get transmitted properly.

So that resolves things, as far as I can tell. Could this be added to documentation or setup flow somewhere? I looked for a while for such a setting, and couldn’t find this option - even knowing that config option, there’s very little mentioning it.

2 Likes

It can be added to this block in the default app.yml sample file:

https://github.com/discourse/discourse_docker/blob/master/samples/standalone.yml#L60-L66

Do you consider this helpful?

1 Like

As long as it’s documented somewhere, I’m happy.

However, if it’s not set, could the code use the value of DISCOURSE_HOSTNAME (which would be correct in almost all simple cases)? Sending ‘localhost’ in EHLO is generally wrong (at least when talking to a server not at localhost).

I think adding it to standalone.yml and web_only.yml is a good idea.

I agree that it should probably default to DISCOURSE_HOSTNAME. That surely seems better than localhost. Has this changed recently?

The thing here is that it is working for all the tens of thousands of Discourse instances that don’t rely on Google for email at the moment. Changing a default like that can break everyone while fixing for the, comparatively small, Google Apps users.

I’ll add it to the sample, and we should also have a #howto topic named “Using Google Apps for outgoing email” that documents this. Can anyone take this?

4 Likes

OK. I’ll agree that it probably makes sense to err on the side of caution, but my guess is that any mail server that will accept localhost would also accept DISCOURSE_HOSTNAME, but I don’t have any, like, data. :wink: Having it in the standard templates is probably Good Enough.

1 Like

Yes, but sending ‘localhost’ (to a remote host) is also wrong, by RFC.

https://tools.ietf.org/html/rfc5321

Emphasis mine.

Older RFCs say that the server should not reject clients based on the EHLO string, which Google seems to be doing, but I don’t see that phrasing in 5321.

I would expect any remote mailserver that tolerates localhost to tolerate (and prefer) a FQDN as required by RFC. I understand the desire not to break things, but as I read the relevant RFCs, Discourse is simply wrong by default, and that it works is a result of excessively permissive remote SMTP servers.

1 Like

I’d be happy to merge a PR to ./discourse-setup that sets it by default to the same as the DISCOURSE_HOSTNAME provided it’s proven harmless with the most common SMTP services we suggest people to use it.

2 Likes

I can’t test full end to end mail delivery because I don’t have accounts, but:

Mailgun

% nc smtp.mailgun.com 587
220 Mailgun Influx ready
ehlo conversation.sevarg.net
250-smtp-out-n04.prod.us-west-2.postgun.com
250-AUTH PLAIN LOGIN
...

Sendgrid

% nc smtp.sendgrid.net 587
220 SG ESMTP service ready at ismtpd0021p1las1.sendgrid.net
ehlo conversation.sevarg.net
250-smtp.sendgrid.net
250-8BITMIME
...

Mailjet

 % nc smtp.mailjet.com 587
220 in.mailjet.com ESMTP Mailjet
ehlo conversation.sevarg.net
250-smtpin.mailjet.com
250-PIPELINING
...

ElasticMail

… doesn’t respond to a helo or ehlo of any sort, really. o.O Localhost or any real domains.

I think setting it during setup is the right answer, because at least it’s there for people to know about and modify if needed.

6 Likes

There is another related issue: discourse-doctor does not appear to properly set the domain, and will still fail when the actual install can send mail.

With the working config, discourse-doctor still reports the end of file failure.

======================================== ERROR ========================================
                                    UNEXPECTED ERROR

end of file reached

====================================== SOLUTION =======================================
This is not a common error. No recommended solution exists!

Please report the exact error message above to https://meta.discourse.org/
(And a solution, if you find one!)
=======================================================================================

There’s no mention of SMTP_DOMAIN in the test script.

root@conversation:/var/discourse# grep SMTP_DOMAIN discourse-doctor
root@conversation:/var/discourse#

And tcpdump indicates that running discourse-doctor still sends ‘localhost’ in the EHLO. This also needs to be fixed.

	0x0030:  cccd b12c 4548 4c4f 206c 6f63 616c 686f  ...,EHLO.localho
	0x0040:  7374 0d0a                                st..
...
	0x0030:  e247 1aa5 3432 3120 342e 372e 3020 5472  .G..421.4.7.0.Tr
	0x0040:  7920 6167 6169 6e20 6c61 7465 722c 2063  y.again.later,.c
	0x0050:  6c6f 7369 6e67 2063 6f6e 6e65 6374 696f  losing.connectio
	0x0060:  6e2e 2028 4548 4c4f 2920 6e6d 3773 6d31  n..(EHLO).nm7sm1
	0x0070:  3032 3832 3139 706a 622e 3620 2d20 6773  028219pjb.6.-.gs
	0x0080:  6d74 700d 0a                             mtp..

That’s not discourse-doctor but emails.rake:

https://github.com/discourse/discourse/blob/master/lib/tasks/emails.rake#L89

Ah, and it looks like it uses localhost. I guess it should refer to #{ENV["DISCOURSE_SMTP_PORT"]}–oh, or DISCOURSE_SMTP_DOMAIN for now.

@falco are we sure this has always been the case and is not a recent regression?

I remember problems related to sending

ehlo {invalid domain}

from wayyy back when so I would be very surprised if we had been incorrectly sending

ehlo localhost

for a long time?

1 Like

Sorry, just reporting what I see. I’m pretty sure ‘localhost’ there is wrong too. My web dev is mostly ancient, it’s been over a decade since I’ve worked with stuff professionally. Getting as far as I did took some time, Docker & Ruby & such are all fairly new to me. tcpdump, on the other hand… I know that tool. :wink:

I still think sending ‘localhost’ to a remote server, in any situation, is the wrong behavior, though.

1 Like

For what it’s worth, we’re experiencing the exact same issue as of Jan 13. The last successful email sent was on Jan 13 and we’re also using smtp-relay.gmail.com — haven’t figured out how to get around this yet (without modifying the source code)