How to migrate quickly to separate web and data containers

:warning: Warning: If you are not comfortable working as a Linux systems administrator, and do not have experience with docker containers, moving to a multi-container deployment will cause you difficulty, and both staff and volunteer help here will appropriately ask you to return to a standalone single-container deployment fully managed by the launcher script.

If you move to a multi-container deployment and your system breaks as a result, you are likely to experience the opportunity to keep both broken pieces. If you read the instructions below and it feels like magic, rather than clarifying how things actually work inside the containers, run, don’t walk to your nearest default standalone deployment and you’ll do yourself a favor.

The recommended method for migrating from a single-container deployment to a multi-container deployment is essentially:

  • Back up your discourse
  • Throw the whole thing away
  • Start over from scratch with a multi-container deployment
  • Restore your backup

If, like me, you have a large site that takes hours to restore, you might wonder whether there is a faster way. Wonder no more! I migrated from a standalone deployment to a three-container web, data, and redis deployment in less time than it typically takes to ./launcher rebuild app for that site. (12 minutes of total downtime, when rebuilding the app has sometimes taken more than 30 minutes.)

If you do this, you are taking responsibility for knowing when you need to rebuild your other containers (data, and if you are silly like me and split out redis, redis as well). You will no longer get free upgrades of everything with ./launcher rebuild app — if you don’t have the resources to manage this process, use a standalone deployment, or purchase hosted Discourse.

Test

Do not use this process to migrate to multiple containers unless, after reading it, you now understand also how it would also tell you how to migrate quickly from multiple containers to a single container. If that is not obvious to you after reading this, then this post is sufficiently advanced technology (indistinguishable from magic), and it is possible that you also might not recognize if this process breaks partway through, so you might end up with a broken Discourse that you don’t recognize until much later. If that happens, you will get to keep both broken pieces. You break it, you buy it, as they say!

Backup

Back up first, and turn on thumbnail backups first so that you don’t have to rebuild them all on restore. If you make a mistake here, you will easily get into a situation where the easiest, safest, and fastest way to recover is to switch to the normal method. Be ready to fall back to the recommended method if anything goes wrong.

Download your backup. The commands below involve moving files around in the Discourse data, and if you make a mistake, maybe you will have deleted your backup. So download it. And, if your backup doesn’t include uploads, back them up too. They are also located where you’ll be moving files around.

Seriously, back up.

When I did this, first I did a backup, then I did a remote system backup, before I went any further.

Set up new multi-container configuration

You will need at least containers/web_only.yml and containers/data.yml and if you also want to split out redis, also containers/redis.yml. Start by copying samples/data.yml (and optionally samples/redis.yml) to the containers/ directory.

If you are deploying redis separately, remove the redis template from the top of the containers/data.yml file.

You have two ways to create web_only.yml.

  1. Copy samples/web_only.yml to containers/; then compare both of them to containers/app.yml, preserving any postgres configuration in params: in your new containers/data.yml
  • Copy any params: for postgres from containers/app.yml into containers/data.yml
  • Create a unique password to replace SOME_SECRET
  1. Alternatively, copy containers/app.yml to containers/web_only.yml and compare it to samples/web_only.yml,
  • Remove any references to the postgres and redis templates
  • Remove the entire params: section that had only postgres settings]
  • Add a links: section, verbatim from samples/web_only.yml or modified (see below) if you are deploying redis in a separate container
  • Add a database section from samples/web_only.yml and create a unique password to replace SOME_SECRET
  • Change the volume definitions from standalone to web_only

Here is the links: section to use if you are splitting redis out into its own container instead of using the reasonable default of bundling it with postgres in the data container:

# Use 'links' key to link containers together, aka use Docker --link flag.
links:
  - link:
      name: data
      alias: data
  - link:
      name: redis
      alias: redis

Here is a copy of the current postgres settings in env in samples/data.yml that you will need to change SOME_SECRET in:

  ## TODO: configure connectivity to the databases
  DISCOURSE_DB_SOCKET: ''
  #DISCOURSE_DB_USERNAME: discourse
  DISCOURSE_DB_PASSWORD: SOME_SECRET
  DISCOURSE_DB_HOST: data
  DISCOURSE_REDIS_HOST: data

Note that for a normal deployment (not multisite) you will not need to modify any other lines. DISCOURSE_DB_SOCKET is for a Unix domain socket for Postgres, it’s not a port number.

Here is an example of the change to the volumes definition at the end of web_only.yml that you will need to use if you copy it from app.yml instead of from samples/data.yml:

@@ -75,10 +80,10 @@
 ## The Docker container is stateless; all data is stored in /shared
 volumes:
   - volume:
-      host: /var/discourse/shared/standalone
+      host: /var/discourse/shared/web_only
       guest: /shared
   - volume:
-      host: /var/discourse/shared/standalone/log/var-log
+      host: /var/discourse/shared/web_only/log/var-log
       guest: /var/log

Now set the same secret password you used in containers/web_only.yml in containers/data.yml instead of SOME_SECRET.

Now you are ready for the migration.

Now is when you take, and download, your final backup before you try the fast migration. Remember, if anything goes wrong here, you immediately go to the recommended method. I can’t stress this enough.

cd /var/discourse

./launcher stop app
cd  shared
mkdir data
mkdir redis
mv standalone/postgres_* data/
mv standalone/redis_data/ redis/
mv standalone web_only
mkdir -p data/log/var-log
mkdir -p redis/log/var-log

cd ..

./launcher destroy app

./launcher bootstrap data
./launcher bootstrap redis
./launcher bootstrap web_only

./launcher start redis
./launcher start data
./launcher start web_only

For me, on a 2-core VM with 4GB RAM and a site with 600MB backups without downloads, this process resulted in 12 minutes of downtime. Your milage may vary.

Note that none of this so far updates the launcher. You may not be up to date. (For example, I ran this after the postgres 12 update was available, but before I had applied it. This process left me with postgres 10. Then the very next thing I did was rebuild the data app, which updated launcher and took me successfully through the postgres 12 update process.)

What to do on future updates

After this migration, if you need to update redis or data, you need to first stop the web app. This would look something like this:

./launcher stop web_only
./launcher rebuild data
./launcher rebuild web_only

However, when neither postgres nor redis needs an update, it’s much faster not to have to rebuild those containers, and most app rebuilds are just ./launcher rebuild web_only.

Again, by moving to a multi-container deployment, keeping track of when that is appropriate is now your job. You’ll get notifications in the Admin console about updates, but they will apply only to the web_only container. Nothing will tell you when you need to update postgres or redis. If you do this, read the Announcements category before every version upgrade you do, and read the release notes for every new version you are upgrading to or through. That is, if you skip a version update, do not skip reading the release notes for the version you skipped updating to. (Consider setting up a watch on release notes, or subscribing your feed reader to https://meta.discourse.org/tag/release-notes.rss to stay up to date.)

Note that rebuilding the web_only container requires that the database be running, so you cannot speed things up by rebuilding all two or three containers in parallel. If you are going to rebuild them all every time, stick with the standard recommended standalone deployment; it will be faster than juggling multiple containers.

Backup Review

If you back up uploads separately from the database, I hope you have a file-based remote backup regime for uploads so that it’s being backed up to restore in case of disaster.

Review your remote backup implementation to make sure that it will back up uploads in /var/discourse/shared/web_only instead of in /var/discourse/shared/standalone so that you keep your backups up to date in your new multi-container implementation.

5 Likes

I’ve never been able to start web_only after rebuilding data. Every time data is rebuilt, web_only needs to be rebuilt too. (It complaints about missing data container or something)

Do this instead:

./launcher bootstrap web_only
./launcher destroy web_only && ./launcher start web_only

It’ll reduce downtown to ~15 seconds

Thanks! Of course my testing was when there were no changes so I didn’t run into that. (Or maybe I did and just didn’t write it all down.) I’ve modified to follow this suggestion, thanks.

I didn’t realize that was understood to be always safe. Is it always unconditionally safe to do all the migrations while the old version is still live and accessing the database?

In any case, it can have more effective downtime than 15 seconds. I just did this on my test instance, and shortly after the migration started during the bootstrap, I got intermittent 500 Internal Service Error text:

Oops

The software powering this discussion forum encountered an unexpected problem. We apologize for the inconvenience.

Detailed information about the error was logged, and an automatic notification generated. We’ll take a look at it.

No further action is necessary. However, if the error condition persists, you can provide additional detail, including steps to reproduce the error, by posting a discussion topic in the site’s feedback category.

Overall, as a site admin, I think I’d rather have a slightly longer time giving 502 Bad Gateway while rebuilding web_only than an error message that encourages people to post to the feedback category for what’s actually an occasional normal thing.

I’m not a pro here so maybe @pfaffman would have better advice. However, I’ve never experienced that behaviour on a dozen or so two container Installations that I’m managing. However, we’ve configured our Nginx reverse proxy to show a custom error page for all 50x class errors so my audience would probably never see that error page. I’d disable error page on a test site and will share my findings.

You should configure offline page plus what gain would we get if we’re still taking site offline for more than a couple seconds in production?

If you want to be sure that building a new container doesn’t hose the one that’s running you need to know about Introducing Post Deployment Migration, otherwise, a few times a year there’s a database change that will break the running web server for the few minutes it takes to finish building the container.

You can play around with testing that by building a stable site and then upgrading to tests-passed.

If you have two containers, and bootstapa a new one and then ./launcher destroy web_only;./launcher start web_only, cranking up a new container takes about half a minute.

If you want zero downtime you need to leave the old container running while the new one cranks up, which takes some extra tooling.

2 Likes

Thanks for confirming that I’m not crazy.

We’re definitely going far past my original goal of describing how to quickly migration from a single-container deployment to a multi-container deployment. I intended this topic to be specifically about faster migration, not about zero-downtime or near-zero-downtime strategies.

I’d like to leave me writeup as it is, with a safe-but-longer rebuild described as what to expect to do next, and ask that folks contribute/link to other detailed topics on other downtime-limiting techniques. :slight_smile:

1 Like

You’re welcome.

Indeed, but someone asked, and you were literally counting seconds. :wink:

Mostly, it’s “safe enough” to do a bootstrap of the web only container.

2 Likes