Self-hosted upgrade to 3.1.0.beta2 with typical multi-container install requires extra downtime

I updated my sites from 3.1.0.beta1 to 3.1.0.beta2, and after bootstrapping the new version, but before destroying the old app containers and starting new ones, at least one of those sites started giving the generic error page to users.

I didn’t notice it on my test site or the other sites I run, but it’s possible that it happened and I didn’t see it.

In any case, for me in at least one case, the “zero downtime” update process did not succeed.

9 posts were split to a new topic: Problems with self-hosted upgrade to 3.x: cannot roll back

A post was merged into an existing topic: Problems with self-hosted upgrade to 3.x: cannot roll back

I’d like to repeat that I was not using the GUI updater. I have a multi-container install. I did:

git pull
./launcher bootstrap app
./launcher destroy app && ./launcher start app
./launcher cleanup

(I use app for the web app even for multi-container installs. I know it’s not normal practice. I hate typing web_only)

A sometime after I started bootstrap and before I destroyed the app, the old version running against the new database showed only an error screen. I don’t remember the contents, and I didn’t create a longer outage by stopping to take a screenshot before doing the destroy/start, but it was only text on white and was not the system maintenance page. I have seen this only a few times before, that when the bootstrap runs db:migrate as part of the asynchronous “zero-downtime” rebuild, the old software still running fails due to a schema inconsistency.

What I saw was whatever happens in the case of database inconsisency. That’s way better than blissfully soldiering on, breaking the database! When I posted, it was to warn that this was one of those rare cases where applying a point update (here from 3.1.0.beta1 to 3.1.0.beta2) created a schema incompatibility between the 3.1.0.beta1 code and the database after running the 3.1.0.beta2 db:migrate, as happens rarely but occasionally with the normal low-downtime updates in the multi-container deployment.

My experience is different from the error that has been reported with ruby in the GUI updater. It’s a completely unrelated problem. I recognize that my post was moved out of the announcement into a general “problems with” thread, but I want to be clear that the reason I posted it in the announce was to warn other self-hosters like me when they saw the announcement that this particular update was one that could have this impact.

My message was not complaining about a bug, or even a problem. It was intended only as notice of a normal but infrequent case associated with this particular release and not called out in the release notes.

The complaints about the docker manager not recognizing that it can’t update from within the image are completely unrelated to my attempt to provide a helpful notification to other self-hosting admins.

It would make a lot of sense to separate these unrelated issues into independent threads for independent problems. EDIT by @supermathie: Done

1 Like

Are you doing a two-stage migration with Introducing Post Deployment Migration ?

This pattern is critical if you’re doing e.g. a blue/green deployment and need the previous version to continue to function.

1 Like

I think that answers the question. The launcher script has no support for SKIP_POST_DEPLOYMENT_MIGRATIONS

Again, I am not reporting a bug. Just trying to warn others with the standard multi-container install using the normal documented process for using launcher with a multi-container install that this one is different from their typical experience.

Really truly, honestly, I mean it, this is not a bug report!

If I want blue/green deployment with launcher I should provide a PR for launcher to implement it. :relaxed:

1 Like

I did not come up with the “problem” in the topic title; that was done when my comment was moved out of the announcement thread. I have now modified the title to make it clear, I hope, that I’m not complaining about a problem. :relaxed:

1 Like

All good!

I suspect that there are a very small number of users doing multi-container blue/green, but we would be welcome to suggestions on how to do that.

And also, how to find the topic I linked; I suspect it’s not easy to find if one doesn’t already know it exists.

2 Likes

I had seen the SKIP_POST_DEPLOYMENT_MIGRATIONS doc. What I had really missed was this post that shows how to do zero downtime deployments with launcher:

So now I have to think about that, now that I know that it is feasible. If I do it, I’ll update MKJ's Opinionated Discourse Deployment Configuration with what I do.

I’ve been having lots of trouble getting worked up about it when I am providing four, some months four and a half nines of availability on a service that I run for free in my spare time. It’s a testament to the quality of Discourse development that I can do that on a tests-passed policy, including things like the extra minute or so of downtime I saw this time, and sometimes rebooting the host for security updates.

3 Likes

The ansible script that dashboard.literatecomputing.com uses runs a rake task after the new container is launched to do the post migrations. It counts on having SKIP_POST_DEPLOYMENT_MIGRATIONS turned on in the web_only.yml. I do this only on sites that I know will be managed by my scripts since if you don’t understand how it works if something of a time bomb.

Note that for many upgrades bootstrapping the new container won’t break things for the running container, but for some it does. It’s not that uncommon that an upgrade will migrate such that the old container can’t use the database (without using SKIP_POST_DEPLOYMENT_MIGRATIONS).

2 Likes