I’m trying to wrap my head around “zero downtime” config. My current setup has a couple of Discourse instances for different communities. Both have data/web 2 container configuration. I have an Nginx at the host level which handles SSL termination and uses a socketed connection that’s handed off to the container’s Nginx.
Have found these two topics of interest:
So I’m trying to understand this process. There seems to be a bit of assumed knowledge for accomplishing this. Any help someone could give here would be great.
The first thing I’d like to understand is how to know when a data-container needs to be upgraded. It seems there are instances where you can’t just rebuild the web-container. How do I know with certainty when this is the case? Would this be all cases where the upgrade option is greyed out in the admin UI panel for upgrades, along with potentially custom work with themes and plugins? Would I be able to know with certainty by looking through database schema migrations? Would I need to have a staging around and just attempt it to know with certainty?
The next thing I’d like to know is how to run a zero-downtime upgrade. The way the two links read to me, you would be doing a rebuild of the data-container and web-container anyways? I’m not able to decipher this. Would I need separate data/web containers in order to accomplish zero-downtime in the end?
Any guidance would be awesome! Probably I could spend a lot of hours and figure out something that appeared to work, but would rather stand on the shoulders of giants and not have to find out edge-cases the hard way (in production) if at all possible.
If you require any more information on my particular setup, please ask for clarification. I will reply directly and update this post.
I don’t know much about “post deployment migration”, but according to what I remember having read (here on meta), one way to achieve this is by using 3 containers: 1 data and 2 web. You update the web-container not running, and switch it as the one you use once updated.
I think this makes sense. So the data container does not run a rebuild via launcher? I would just handle load balancing via the host level Nginx. Let me see if I can sort this:
data container:
./launcher enter data_container
SKIP_POST_DEPLOYMENT_MIGRATIONS=1 rake db:migrate
It all depends on what updates need to be made to the data container.
The Postgres 12 update is a good example of unavoidable downtime. Even if you had a duplicate data container you would need to run your dupe site read-only while the database upgrade happened.
The only way to never have downtime is to never upgrade. Updates via /admin/upgrade on a single container install are already zero downtime. Updates made via ssh such as when the base image needs updating will have varying degrees for downtime based on your budget and appetite for complexity.
The best way to avoid downtime is to build a staging copy, otherwise you run a small risk of downtime after every update when plugins or customizations encounter compatibility issues.
Okay. So to ensure I would not have a major issue, do a run in staging and observe results… so I would attempt the above procedure and see if the data container failed?
If so my staging could consist of 1 data, 1 web and production would be 2 data, 2 web. Worst case scenario if attempted first in staging would be to in production:
If users can’t access the site that’s downtime, obviously.
If they can’t register, post, reply or like would you consider that to be downtime?
If this is a large community the costs of running multiple data containers on ssd will be considerable. Have you considered an external postgres server such as Amazon RDS?
The kind of details that @Stephen is pointing out are really important. Because we need to understand what zero downtime is, for example I could hack a Zero Downtime requirement by doing the following:
I define zero downtime as never responding to the user with a code other than HTTP 200 when the request is valid (keeping 300 and 400 open as needed). Then I deploy Discourse in a 10$ droplet in a one-container solution and add Adding an offline page when rebuilding to accomplish the no 500 errors. This way I don’t show a site that has been down.
Would I in a rational mindset think that this is zero downtime? Never. Does it works as proposed? Of course. And I could go ahead and add a standby server in another region to be even more zero downtime proof.
That is why qualification and semantics are important. Is not the same to always show something to always have functionality on the site.
Just help us understand. What do you need to accomplish your zero downtime definition? Can users suffer 10-30 minutes of read-only? Are you savvy enough to hack a solution? Are you looking to get our users a nice page that says Under maintenance, be right back. We need more details to give you a more accurate solution that really works for you. Or, at least, point you in the right direction.\
This discussion was getting a bit heated and off topic. Please remember to be respectful to each other while discussing a topic. Clarification questions are asked to give a more precise answer, as everyone’s setup and goals are different.