Is this process safe? I can run a multicontainer setup just fine in a dev environment, but if I start using it in a production environment, while people are accessing the old container and the new container bootstrap and runs the db migration step, the requests to the old container will still use the old backend logic and save data as defined in the previous version, even after the db migration step ends (but before the entire bootstrap process is done).
Although I know that this is not a problem related to discourse itself (an environment with several replicas could have this problem if one replica is updated before the other, unless you stop all of them before upgrading, but this probably won’t be the case if you want HA), the process you described is still safe, in a general way?
One thing that I can think about is making sure to always keep discourse up to date to have the minimal db migration between rebuilds. But in any case, this still is not ideal, and problems could arise even in this case.
The multicontainer setup seems like one of the recommended approaches (although not the standard one with just 1 container), so I think it should be safe, and I’m just overthinking.
Do you know if it works just fine in production sites (doing the bootstrap in one container even while another is running)? I’m just asking to know about people that have already done it in production sites, to have some feedback and know if it works fine, even after several rebuilds, if there are some gotchas, etc… Like I said, in a dev environment it works fine.
If you want zero downtime, there’s a couple extra steps that need to be done - disable “post-migrations” on the new containers, roll out the new container fully, enable post-migrations, then roll out again. This will stop any column-dropping migrations from executing until the old code is no longer running.
There’s no howto for this yet, it’s only documented here:
For the most part, though, most forums can live with a minute or three of downtime every month.
Then we select the container we want to be live by a symbolic link to the actually socket, like this:
Say we want the socket2 container to be live:
ls -sf /var/discourse/shared/socket2/nginx.http.sock /var/run/nginx.http.sock
Say we want to make a change on socket1 and make socket1 live:
cd /var/discourse
./launcher rebuild socket1
ls -sf /var/discourse/shared/socket1/nginx.http.sock /var/run/nginx.http.sock
Notice, there is no reason to bootstrap only the socket1 container, because the container is exposed via a unix domain socket in it’s own shared directory / volume, so both of these “web app” containers can run at the same time:
This is no exposed “port binding collision” like when a TCP/IP container port is exposed. For this reason, I only expose a unix domain socket in production (not a TCP/IP port)
Of course, you can bootstrap if you want:
cd /var/discourse
./launcher bootstrap socket1
./launcher start socket1
ls -sf /var/discourse/shared/socket1/nginx.http.sock /var/run/nginx.http.sock
It’s up to you, but keep in mind if you run both containers at the same time, both will run sidekiq and run scheduled jobs, so has been our experience; so we occasionally sync our uploads in both containers.
This works flawlessly for us and we can rebuild a web app container and make it live with basically zero downtime. We take downtime very seriously in production and avoid it whenever we can.
Note:
This method (above) is designed for the web app part of the solution, not for the data container. I have not created a similar solution for data; but who knows, maybe someday, I’ll spend some time and build something similar (but different, of course) for the data container (some kind of "two data container, sync the DBs, method, totally TBD at this point in my mind).
So, I am actually doing the opposite of this:
Like I said, in a dev environment it works fine.
I generally do not set this up in dev because it takes longer to set up and is not necessary in dev because a little down time is fine, since it’s just “me and the code” (not live with users and bots hitting the site) and plus I do not use docker in development** (on the desktop).
Hope this helps.
**By “development”, I mean software (for example plugin) development; not just simply “staging” a Discourse install, which I refer to as “staging” not “development” (just to be quite clear).
Thanks, I didn’t know about that. It’s a great feature and can see it avoiding problems like the one I mentioned before (although this shouldn’t avoid changed logic using already existing columns, but this should be a more rare case).
Thanks for the answer. The SKIP_POST_DEPLOYMENT_MIGRATIONS seems like it’s what @riking mentioned and seems like what I was after, to avoid migrations breaking stuff done in the running container.
Thanks for your explanation. That seems like a good approach to me, alternating between 2 containers (to be able to have 1 running container while the other bootstrap).
I said dev environment meaning to be that it was a remote development environment that I used to test the multicontainer setup (both in the same machine as well as having the containers in different machines).
I said dev and not staging because in a staging environment I would use the yml files with the same plugins and use a backup production database to test. But true, if I just want to setup a dev environment, in most cases, I would use the 1 container approach.
As far as I can tell, this guide is lots of words around:
back up
create a completely new discourse instance, with more words but the same results as just running discourse_setup 2container
restore
Why wouldn’t you just move or copy /var/discourse/shared/standalone/{postgres,redis}* into /var/discourse/shared/data after a clean shutdown and before starting up two new containers from separate containers/*.yml files? A backup/restore seems like a really heavy-weight way to move all that data across, adding hours unnecessarily to the process. Am I missing something obvious here?
I just tested this process on my test discourse, and split out redis too as long as I was at it, just to make sure I was covering all the bases. Edit: I’ve moved the description to a new topic:
The site seems to be functioning fine without a backup/restore cycle. Is there something non-obvious I should be checking for?
I did the same process for a relatively large discourse and it’s working fine. I decided that in production I would name my new web_only container app so that my fingers will keep normally doing the right thing. After I wrote the new container/*.yml files, the downtime for the entire migration was 12 minutes, far faster than it would have been for a backup/restore cycle.
I guess we just agree to disagree then. I’d think if someone is competent to run multiple containers, they can run a few commands, and I don’t think those commands are harder to type than any of the bin/rails c and start typing ruby commands all over here, or require meaningfully more or different skills than required to use multiple containers at all. But I’ll migrate the content to a separate new post rather than leaving it buried in a comment here.
There’s the flaw in your argument. Adding 2container to ./discourse-setup doesn’t include any measure of competency. There are a lot of people who run two containers simply because they see topics like this and assume there’s some secret sauce or it’s the “thing to do”.
The postgres 12 topic should serve as a cautionary tale when it comes to the added complexity. Using a backup as the step between states allows a user to revert to a single container by renaming a single file, once you start moving folders that simplicity is lost.
@Stephen There’s a flaw in your argument: The multi-container description is full of warnings that you have to take responsibility for updates and understand how it works, and the long description above is so obfuscated that probably anyone who looked at it would give up anyway. Go read my How to migrate quickly to separate web and data containers and tell me that it won’t scare away the people who will have trouble following it, or that it fails to emphasize the necessity of backup and ability to fall back to a backup if anything goes wrong!
I was deeply unhappy when I hit ./launcher rebuild app shortly after migrating to a more capable server (for a security fix) and having my site down for an egregiously long time, much of which was in rebuilding the postgres parts of the container. That was when I found the 2container documentation and this documentation and really didn’t want to take another 4 hour downtime to migrate, so I kept taking long downtimes for ./launcher rebuild app to avoid the 4 hours of downtime a restore would take. As a vaguely competent person, I was very annoyed for a long time that this configuration was effectively hidden.
The postgres 12 topic is a great reference, because people end up with more downtime because they have to rebuild the entire app multiple times, when they could be rebuilding only the postgres container twice. I can’t say I’ve read the entire thread due to the 6-day auto-delete thing but it’s not at all obvious to me that incompetent multi-container deploys are the, or even a, big problem there.
(Sorry, sometimes I get a little tired of the “all users are incompetent” around here.)
It may not make sense to you, but for those of us who have been here on meta for 6/7 years offering assistance to people in support it will always make sense to have a rollback strategy.
Bugs do make their way into tests-passed, occasionally rubygems rate-limits impact rebuilds and even GitHub has had the odd wobble. For that reason alone I don’t see any value in making a state change which makes it more difficult to simply rename a file and ./launcher start app.
Your appetite for risk may be different, in which case you can take a different route. For those who regularly help pick up the pieces the current guide works well.
I don’t feel like you actually read the process I wrote, since you write as if I didn’t emphasize the need for the ability to do the restore. Please go read it and then come back and consider editing what you wrote to be a truthful statement regarding my instructions. As it is, I feel like you are 'splaining at me without doing me the courtesy of bothering to read what I wrote.
However, I have added many more additional warnings, including one at the top, in excess of warnings in this post which has been for five years and continues to be the canonical location for instructions on this migration process.
First of all, thank you for trying to clear this “jungle” for us, adventurers , i’ll probably be in your step in a few days…
What is in the multisite.yml file ?
I recently setup a self-hosted Discourse forum on a VPS. Uploads are being stored on Wasabi. Same for backups. Everything is hosted at Linode.
I used the standalone template and found the setup to be a breath of fresh-air compared to other software. It was sublime! A pure joy. I wish every open source project paid as much attention to install and setup as Discourse did.
Here the thing tough. I have a dedicated PostgreSQL running on a separate server only accessible via an RFC-1918 address that isn’t accessible to the Internet that I’d like Discourse to use. I’m not a huge fan of database servers running on the same server as the web/ application server.
So is there any way to separate the standalone database and move that over to my dedicated database cluster?
I’m assuming all I’ll need to do is to do a pgdump of the Discourse database, move it over to my dedicated database and restore, followed by a postgres vacuum analyze all tables after the restore/ import and then just point the Discourse app to the new database across the wire?
But I can’t seem to find where the database credentials are stored. I looked in app.yml but there doesn’t appear to be any database entries and when I looked in the …/templates/ folder none of the yml files had any database credentials it doesn’t look like.
If there are no credentials normally for the built-in postgres database is there a way to run a pgdump and dump the database from the standalone container?