Let's Encrypt cert renewals (suddenly) failing

Sometime back – it’s not clear how long but at least several months – Let’s Encrypt renewals started failing on my Discourse forum, after running fine for years. When I initially noted this some days ago, the cert had expired in August 2021. After trying some manual renewals and nginx restarts, I found the cert bumped up to expiring just a few days ago. Still not a current cert, obviously. Manually running acme.sh to force a renewal (within the discourse container) is yielding this error (where [site] is my site address, of course):

[site]:Verify error:Fetching http://[site]/.well-known/acme-challenge/[long alpha challenge string]: Error getting validation data

I should note that the site requires login for all user access, but this has not been a problem for ssl cert renews during the years of prior operation.

Any ideas? Thanks very much!

UPDATE: Testing the verify using wget returns a 404. However, I do not know where this data is configured into nginx for Discourse in a container and how it relates to the related nginx that proxies outside the container.

2 Likes

If it’s from a few months back, is it anything to do with this:

2 Likes

Hi. That shouldn’t relate, because that issue would cause certs to be rejected by the browser with different errors, not expired errors as in my case. It appears that Let’s Encrypt suddenly can’t authenticate with Discourse to deliver new certs. Thanks.

2 Likes

Not if the first expiration was in August. It should have been renewing after that.

4 Likes

For an app not updated after june there could has been this issue though: Letsencrypt certificate failure to renew - #11 by pfaffman

not sure if it’s what you’re looking for but: discourse_docker/templates/web.letsencrypt.ssl.template.yml at main · discourse/discourse_docker · GitHub

2 Likes

Hi. These don’t appear to apply. I’m seeing a 404 error not those other errors, the builds have been updated all along, and that template from github is indeed the version already installed in my installation. Thanks!

3 Likes

Are you using cloudflare with the orange cloud or some other reverse proxy?

2 Likes

Negative. Locally hosted on Ubuntu 18.04 using the default Docker install.

3 Likes

Manually running the (in container) cron job for cert renewal, the failure is always the same. The attempt to get:

http://[site]/.well-known/acme-challenge/[challenge-string]

fails with “Error getting validation data.”

2 Likes

Not being familiar with the process, could it be expecting the container to be in a state which is not the case when running that script alone? E.g. perhaps it’s expecting another cron job to occur first which prepares nginx for allowing access to such a URL.

Have you tried doing a rebuild? (Which will attempt to obtain a new certificate in the process.)

You mention it’s hosted locally. Are you able to access the instance from outside your network using the domain name?

2 Likes

Hi, yes multiple rebuilds. No change. I use Let’s Encrypt on a bunch of non-discourse sites and they all renew just fine. Yes, I can access from an external site and I’ve tested using wget, the result is 404. Question: Exactly where does the nginx html tree live in this case, the part that would contain (or should contain) the .well-known directory? I have been unable to find it. Thanks.

2 Likes

I couldn’t find a cron job, just a runlevel script at /etc/runit/1.d/letsencrypt. It looks like that script starts a new instance of nginx with a config which includes this:

location ~ /.well-known {
  root /var/www/discourse/public;
  allow all;
}

I think that means the path would end up being /var/www/discourse/public/acme-challenge, though it may well be created before the challenge, then removed afterwards.

If that’s the script you’ve tried running manually, did you stop nginx first? The instance the script tries to start will try to listen on port 80 so I suspect that would fail if nginx is already running for Discourse.

2 Likes

I think I may see the problem. But I don’t know how to fix it. It appears that all attempts to access the forum on https port 80 are (as expected) being redirected https 443. Right. But this means that when Let’s Encrypt attempts to validate for the renewal, it fails, because the current certificate has expired. I can see the redirect with wget. So the question is, how do I disable the redirect temporarily so that Let’s Encrypt can validate and get me a new non-expired cert? An additional possible complication is that the redirect is a 301 permanent. Thanks.

2 Likes

This redirect is in /etc/nginx/conf.d/discourse.conf and will not be used when nginx is stopped, then started with the config mentioned in my previous post.

I’m afraid I’m not very familiar with how the auto-upgrade works so I’m not sure what the appropriate method would be to renew while the container is running. In theory, just stopping and starting the container should result in it renewing but since you said a rebuild didn’t do it, that probably won’t either.

acme.sh has options like --renew-all but I’m not sure what other options are needed for it to do the right thing here. The following might be all you need but I can’t say for certain.

LE_WORKING_DIR="/shared/letsencrypt" /root/acme.sh/acme.sh --renew-all
2 Likes

And indeed this does permit Let’s Encrypt to get in without the redirect, but apparently the file it’s looking for does not exist, so ultimately the same verification failure.

2 Likes

I have the same problem. Has anyone worked out a clear procedure the correct the problem?

2 Likes

I’m now using this to try successfully get the cert. It appears that the validation token IS being retrieved by curl, but acme.sh is STILL declaring a validation failure every time! So still down.

“/shared/letsencrypt”/acme.sh --renew-all --force --insecure --home “/shared/letsencrypt” --debug

2 Likes

Hi @L30110 :slightly_smiling_face:

I’m one of the regulars over in the Let’s Encrypt community. I was sent by @JimPas to take a look into this thread, which I will do as soon as I return from lunch.

3 Likes

With many ACME clients (like acme.sh) when nginx is specified as the authentication method, the http-01 challenge file is created in a specific directory based on an exception/redirection in the nginx server configuration rather than directly in the .well-known/acme-challenge directory structure in the webroot directory. Often this redirection only temporarily exists for the duration of the challenge verification, as do the challenge files themselves.

Hence:


A wise consideration. A properly written renewal script should make it unnecessary to stop nginx. Typically, nginx is used to serve the challenges file(s) then something resembling nginx -s reload is used to gracefully reload the web server/proxy once the new certificate is acquired.


Nope. :wink:

Per Challenge Types - Let's Encrypt :

Our implementation of the HTTP-01 challenge follows redirects, up to 10 redirects deep. It only accepts redirects to “http:” or “https:”, and only to ports 80 or 443. It does not accept redirects to IP addresses. When redirected to an HTTPS URL, it does not validate certificates (since this challenge is intended to bootstrap valid certificates, it may encounter self-signed or expired certificates along the way).


Typically when we see issues like this, one of these is usually the culprit:

  • A firewall is not allowing traffic through to the web server/proxy that’s serving the challenge file(s)
  • A router/proxy is improperly mapped/configured such that the challenge verification request from Boulder (the Let’s Encrypt CA server) attempts to retrieve the file(s) from an incorrect web server or directory.
  • Some type of rewriting/redirection (e.g. .htaccess files in Apache) is interfering with the web server/proxy being able to serve the challenge files from the correct location.
  • Usage of non-standard ports, usually with improper mapping.
  • The container running the ACME client is creating the challenge file(s) in a place where the web server/proxy (e.g. nginx) does not serve them. When docker is involved, this is almost always the problem.
2 Likes

Hi. So, of those various items you listed, several clearly do not apply in my case. Not a firewall issue – I can manually access the token with wget or curl from all of 1) inside the docker discourse app, 2) from outside the docker container on the host system, and 3) an related system.

For these manual cases, I DO get the token contents back from the expected location, assuming that --ignore or -k is specified to get past the expired cert when Discourse redirects to https automatically.

I have not changed any aspect of the nginx configuration created by Discourse, either inside or outside the Discourse docker container. I don’t run any copies of nginx, and Apache lives on completely different ports for local use only. Noting that all of this had been working fine for ever two years, with routine cert renewals and no other app changes – it’s a very stable box.

No unusual ports.

Since I can get the token contents manually, I don’t see how wrong locations could be involved. EXCEPT …

I was not manually stopping nginx for my tests. I’ve now done so, and there was no significant difference – same errors from acme.sh (currently error 56 again). When nginx is stopped from inside the container, I do see a runsv nginx instance on the host, but it has no worker or cache processes. When I restart nginx in the container, the worker and cache procs reappear on the host along with the runsv nginx that had remained. The sv start/stop nginx commands inside the container give the expected confirmations of those actions.

But there’s something else mentioned above that may be of concern. And I don’t understand why this would suddenly be a problem given how long things were working up to now.

The static ip address of the forum that is used from outside my local networks is not usable by that machine for connecting to that machine’s own services, due to complexities in the way that the static IPs are provided by the ISP. I’ve routinely used entries in /etc/hosts to provide local network ip addresses for those names. So, when I test with curl on the same machine (inside or outside the container, they both have the /etc/hosts addition for the forum), the test uses a different (and local) ip address than would be used by an external site looking it up via DNS. Is there any way that this might be relevant? Thanks.

2 Likes