Old Let's Encrypt netcat setup failing because it responds to the first request


(Jeff Atwood) #1

Documenting this because I seem to run into it over and over… :angry:

:warning: This conversation is about an old version

The old version of our Let’s Encrypt setup was failing sporadically because it was using netcat to listen for the “next” HTTP request after asking for cert verification from Let’s Encrypt. On a busy live Discourse site that request was highly unlikely to be from the Let’s Encrypt site for the cert verification.

You get a zero length certificate back after following this howto, to the letter:

-rw-r--r-- 1 root root  424 Aug  9 04:43 dhparams.pem
-rw-r--r-- 1 root root    0 Aug  9 04:44 forum.example.com.cer
-rw-r--r-- 1 root root 3243 Aug  9 04:44 forum.example.com.key

When you delete these files and rebuild, same problem. If you try to reissue the cert manually you get:

[Tue Aug  9 08:46:04 UTC 2016] url='https://acme-v01.api.letsencrypt.org/acme/challenge/25oDC8PA6GMq-6N_RWRbcNCEYhEK3FlDBreQzgm0YRo/225875801'
[Tue Aug  9 08:46:04 UTC 2016] _CURL='curl -L --silent --dump-header /shared/letsencrypt/http.header '
[Tue Aug  9 08:46:04 UTC 2016] _ret='0'
[Tue Aug  9 08:46:04 UTC 2016] code='202'
[Tue Aug  9 08:46:04 UTC 2016] sleep 5 secs to verify
[Tue Aug  9 08:46:09 UTC 2016] checking
[Tue Aug  9 08:46:09 UTC 2016] GET
[Tue Aug  9 08:46:09 UTC 2016] url='https://acme-v01.api.letsencrypt.org/acme/challenge/25oDC8PA6GMq-6N_RWRbcNCEYhEK3FlDBreQzgm0YRo/225875801'
[Tue Aug  9 08:46:09 UTC 2016] CURL='curl -L --silent'
[Tue Aug  9 08:46:10 UTC 2016] ret='0'
[Tue Aug  9 08:46:10 UTC 2016] forum.example.com:Verify error:Could not connect to http://forum.example.com/.well-known/acme-challenge/d7EU_9WX0MZCMXvXmNN8b-_OWbaT4XjQdeIKCJISo6M
[Tue Aug  9 08:46:10 UTC 2016] GET
[Tue Aug  9 08:46:10 UTC 2016] url='http://forum.example.com/.well-known/acme-challenge/d7EU_9WX0MZCMXvXmNN8b-_OWbaT4XjQdeIKCJISo6M'
[Tue Aug  9 08:46:10 UTC 2016] CURL='curl -L --silent'
[Tue Aug  9 08:46:10 UTC 2016] ret='56'
[Tue Aug  9 08:46:10 UTC 2016] Skip for removelevel:
[Tue Aug  9 08:46:10 UTC 2016] pid='16542'
[Tue Aug  9 08:46:10 UTC 2016] GET
[Tue Aug  9 08:46:10 UTC 2016] url='http://localhost:80'
[Tue Aug  9 08:46:10 UTC 2016] CURL='curl -L --silent'
[Tue Aug  9 08:46:10 UTC 2016] ret='7'

The root issue seems to be the fact that it can’t connect …

forum.example.com:Verify error:Could not connect to http://forum.example.com/.well-known/acme-challenge/d7EU_9WX0MZCMXvXmNN8b-_OWbaT4XjQdeIKCJISo6M

… but I have no idea why that’d be happening as this is a stock Discourse install on Digital Ocean, there’s nothing special about it.


Setting up Let's Encrypt
(Andrew Bereza) #2

This exact same thing is happening to me too (Also DigitalOcean). My forum isnt loading at all… Any fix?


(Jeff Atwood) #3

It seems let’s encrypt might not be working at the moment, probably due to the acme.sh script we use being broken. This is about the third time this has happened, so I suspect it will keep happening in the future indefinitely at this rate. We should consider alternatives @tgxworld

The workaround is to buy a traditional SSL cert and follow our other #howto on that.


(@SenpaiMass) #4

What about traditional certbot, can that not be used with discourse?


(Matt Palmer) #5

The official certbot client is very heavyweight, requiring a lot of Python stuff. As a Ruby-based project, with a large image already, stuffing all that extra Python gubbins in as well would make a fat image even fatter. Using certbot isn’t an impossibility, just a “we’d really rather not if we can manage it”.


(Daniel Gagnon) #6

certbot could be in an optional template just like your current solution is. It might take a bit of extra space but as far as resources consumed it’s pretty lightweight.

If you offered it as an alternative configuration I might choose it.


(Alan Tan) #7

Just to clear things up a little, the current problems encountered will not be fully resolved just by switching clients. The main problem is that we only issue the cert when the container is created instead of during bootstrap. So if anything goes wrong and we end up with a bogus cert, nginx won’t be able to start and the site goes down. Can we do it during bootstrap instead? Yes that was how the template worked during the first iteration but the user is required to publish port 80 manually by passing the right docker commands to ./launcher. We felt that the extra step will complicated things for users and decided to hide all the internal details from the users by moving the commands to a script that runs when the container starts. However, that approach doesn’t seem to be working out well now and I’m inclined to add an extra command into launcher that issues the cert. Something like ./launcher setup_ssl app


(Brahn) #8

Wouldn’t you still have to mess around inside the container to automate renewals?


(Alan Tan) #9

Yup but that can be done during bootstrap as well.


(Matt Palmer) #10

Does there come a point at which we just say, “stuff it” and build a separate container for SSL termination, containing all the magicks for LE issuance and renewal and an nginx config that just listens on 443, and forwards to the port-80-only Discourse container? It’d at least remove the problem of a lack of SSL issuance causing everything to halt and catch fire…


(Jeff Atwood) #11

I think that screws up a lot of the work @sam did to get http/2 working.


(Matt Palmer) #12

It’d actually be closer to how we’re supporting H/2 internally, so I’d be surprised if that turned out to be the blocker on such a plan.


(Sam Saffron) #13

It’s definitely cleaner to pile on an extra container here, cause port 80 keeps on working even if SSL is somehow messed up.

You no longer have to do a full rebuild to clean up SSL issues and so on.

One tricky bit though is … how do you handle the port 80 redirect to port 443.


(Jeff Atwood) #14

A whole separate container on a single container instance seems to defeat much of the simplicity of the approach.


(Sam Saffron) #15

What I do on https://samsaffron.com is run the official LE container to renew the certs and communicate via shell to restart NGINX in the Discourse container.

Initially I started with a dummy SSL cert so NGINX boots, then the LE container just runs on a schedule.

It allows me to use the official LE image which is more robust.

But the setup itself is far more involved.


(Jay Pfaffman) #16

FWIW, my paid-for cert expired a day or two ago, and I just I added the letsencrypt lines to my app.yml file, did a ./launcher rebuild app and it worked just fine.


(Alan Tan) #17

Hmm looks like the number of problems have increased. I’ll look into a better setup next week.


(Alan Tan) #18

I had a look into the source code and realized that the standalone server uses netcat under the hood and it will only serve a single connection before exiting. For sites with high traffic, it becomes a game of chance on whether the validation request will be that connection served by netcat.

cc @Neilpang Do you know if this problem has been reported before?

To resolve the problem, I’ve decided to use nginx during the container’s setup process to serve the validation request if the cert is up for renewal.


(Jeff Atwood) #19

Can the connection be triggered to only serve when the incoming connection is from Let’s Encrypt, e.g. ignore any other requests?


(Alan Tan) #20

Unlikely with netcat, it listens for a single connection and exits.