Zero Downtime Upgrades

Hello there,

We are wondering if zero downtime upgrades are possible. We are using the discourse_docker installation using separate web and data containers, so the issue is more around the old application code (v2.1) running against a migrated v2.3.0 database schema. This happens between the time we bootstrap the image (which will run the migrations) and the moment the web servers get restarted (updated) to run the upgraded image. Are there any recommendation for this?

11 Likes

We successfully performed the upgrade a couple of months ago. Iā€™m posting our experience here for anyone who may have the same question in the future.

Discourse somewhat supports zero downtime upgrades by using post deployment migrations. The overall process, as we understood it, can be divided in 2 steps.

STEP 1: Upgrade to the new version and run safe migrations:

  • Update all your plugins and themes to be compatible with the new version. (This is actually quite a bit of work depending on your setup).
  • Generate your web_only.yml containing:
    version: <NEW_VERSION>
    SKIP_POST_DEPLOYMENT_MIGRATIONS=1
  • Bootstrap (./launcher bootstrap web_only)
  • Restart your servers

STEP 2: Run post deployment migration (high risk migrations such as columns and table drops etc). By now this step should be safe since all your servers are running the new version of the code and should not depend on the data being dropped:

  • Generate your web_only.yml with:
    version: <NEW_VERSION>
    SKIP_POST_DEPLOYMENT_MIGRATIONS=0
  • Bootstrap (./launcher bootstrap web_only)
  • Restart your servers

Complications:
We decided to run STEP 1 and STEP 2 at different dates and that caused some issues. There was a big refactor on the polls model between 2.1 and 2.3. It seems that initially most of the poll data was stored as a json object on the post_custom_fields table. By 2.3 they were moved to their own polls table. This required some data migration which was done as part of the post deployment migrations (STEP 2).

Our particular issue was that right after migrating to 2.3, polls created before the upgrade were broken, likely because the code that rendered them assumed the new data model. Some users noticed this and tried to manually update the polls from the UI and by doing so they ended up creating records on the new polls table. Such records, as we sadly discovered later, would trigger a postgres unique constrains during the bootstrap process effectively killing it.

To circumvent this we decided to put a patch that skipped a particular poll migration if it already existed in the db. This was not a perfect solution since one might lose data from the post_custom_fields that never get migrated for this posts. But in our case this was still a good tradeoff since the number of cases were very low across our system (2 instances that we could observe) and allowed us to run the bootstrap process. Now, testing and applying the patch brought up two more questions:

  • How do we test the change before we put the PR?
    We wanted to test our code under the same conditions it was gonna run in reality, that meant testing it against the bootstrap process. This is not as trivial as testing runtime code changes which you can do by running the standalone discourse app locally.

  • How do we incorporate the changes into our system?
    We did not want to wait until the patch reached an official discourse release, which can be a lengthy process and would basically force us to do another upgrade. We already had customers complaining about existing polls and the longer we waited the higher the changes of customers manually modifying polls aggravating the data integrity issues.

So we found a way to test the changes we were introducing by changing the origin of the repo that the bootstrap script uses. This took some trial and error since we are not very knowledgeable on pups and other tooling around this process. At the end, we did something like this.

This allowed us to verify that our fix worked as intended. We were also able to bootstrap a new image on production from our own modified version of discourse containing the patch, without having to wait for an official discourse release.

14 Likes