Discourse upgrade via Web UI Fails & SSH Upgrade Brings Down Discourse Instance

NOTE: Original post updated 11/25/21 PM EST with new information

Notified of critical security updates to my Discourse installation I attempted to update my installation using the Web UI (https://forum.legably.com/admin/upgrade) as I have done in the past. There were two pieces of software that needed to be upgraded—Docker Manager and Discourse.

The Docker Manager had to be upgraded first (the Discourse upgrade button was disabled). I started the Docker Manager upgrade using the Web UI and it completed successfully. I then started the Discourse upgrade but it failed midway through. When I refreshed the Web UI I saw the following message:

image

So, following the onscreen instructions, I SSH’d into the server, did a git pull and then ran sudo ./launcher rebuild app from the command line. The process finished but failed with a FAILED TO BOOTSTRAP error message.

Here is the output of two runs of sudo ./launcher rebuild app at different times:

The line numbers after each file are where the only ERRORs appear. Both appear to be database and role related (the difference between both ranges is because the second attempted a git pull from the discourse/base repository).

2021-11-25 21:21:38.451 UTC [64] postgres@postgres ERROR:  database "discourse" already exists
2021-11-25 21:21:38.451 UTC [64] postgres@postgres STATEMENT:  CREATE DATABASE discourse;
createdb: error: database creation failed: ERROR:  database "discourse" already exists
I, [2021-11-25T21:21:38.454429 #1]  INFO -- :
I, [2021-11-25T21:21:38.454908 #1]  INFO -- : > su postgres -c 'psql discourse -c "create user discourse;"' || true
2021-11-25 21:21:38.531 UTC [68] postgres@discourse ERROR:  role "discourse" already exists
2021-11-25 21:21:38.531 UTC [68] postgres@discourse STATEMENT:  create user discourse;
ERROR:  role "discourse" already exists

This appears to dovetail with the FAILED error message displayed at the bottom of each Launcher Rebuild attempt.

FAILED
--------------------
Pups::ExecError: cd /var/www/discourse && su discourse -c 'bundle exec rake db:migrate' failed with return #<Process::Status: pid 436 exit 1>
Location of failure: /pups/lib/pups/exec_command.rb:112:in `spawn'
exec failed with the params {"cd"=>"$home", "hook"=>"db_migrate", "cmd"=>["su discourse -c 'bundle exec rake db:migrate'"]}
13bbdd52e0835ba9dfddc5c367d63b6087a16553c3a77d27ca307734d6e16907
** FAILED TO BOOTSTRAP ** please scroll up and look for earlier error messages, there may be more than one.
./discourse-doctor may help diagnose the problem.

Note: These ERRORS are not the root problem. See “Solution” below.

Some people below have said that there is an issue with redis that is preventing a successful rebuild.

I ran `sudo ./discourse-doctor at various times during the day. Here is the output from two of the runs:

I verified that my Docker installation was running correctly by running sudo docker run -it --rm hello-world

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
2db29710123e: Pull complete
Digest: sha256:cc15c5b292d8525effc0f89cb299f1804f3a725c8d05e158653a563f15e4f685
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

I ran sudo ./launcher cleanup to make sure I had enough disk space.

WARNING! This will remove all images without at least one container associated to them.
Are you sure you want to continue? [y/N] y
Deleted Images:
<DETAILS REMOVED>

Total reclaimed space: 3.836GB

$ df -hT /dev/xvda1
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/xvda1     ext4   30G  9.1G   20G  32% /

And I even checked my memory settings.

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           1.9G        304M        633M         20M        1.0G        1.5G
Swap:          2.0G          0B        2.0G

A reboot of the server did not solve the issue but I did notice something interesting after rebooting the server.

The Docker app container is running after a reboot.

$ sudo docker ps
CONTAINER ID   IMAGE                 COMMAND        CREATED       STATUS          PORTS                                                                      NAMES
6449ec0061a0   local_discourse/app   "/sbin/boot"   7 weeks ago   Up 25 seconds   0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp   app

But when I go to the site I get a “502 Bad Gateway” error.

When I stop the app container and go to the site I get a “Unable To Connect” error (which seems right since the container isn’t running).

But this puzzles me since I don’t have Nginx installed on this server.

I can see in the Rebuild output where the process is copying Nginx files from one location to another but I cannot find the corresponding directories or files, specifically nginx.conf on my server anywhere. Ubuntu, Docker, and Discourse are not my primary skills but I am assuming that these files are being copied “within” the Docker app container.

Thanks in advance; appreciate any additional help or direction with this issue, which seems to surface from time to time during Discourse upgrades.

UPDATE: It turns out my assumption regarding the Docker app container having its own internal filesystem is correct. You can create a snapshot of the container filesystem and explore this filesystem using bash.

# create image (snapshot) from container filesystem
$ sudo docker commit <container_id> mysnapshot
$ sudo docker run -t -i mysnapshot /bin/bash

In the app filesystem there is an nginx directory that contains a Discourse configuration file.

root@f91826d986eb:/etc/nginx/conf.d# ls -l
total 12
-rw-r--r-- 1 root root 10568 Oct  3 21:33 discourse.conf

How about restart? From here

This update requires the docker to be restarted.

So, yes your discourse is gonna be offline while you run ‘./launcher rebuild app’ in then command line.

All will be 100% once the rebuild is completed.

@IAmGav Docker appears to be running.

@rmccown ZSm8WzJ7gLigPd08D4tiwt.png)

I will try some of the other launcher commands to see if I can get things up and running.

@IAmGav Ran ./discourse-doctor and it also confirmed that Docker container is running.

==================== DOCKER INFO ====================
DOCKER VERSION: Docker version 20.10.11, build dea9396

DOCKER PROCESSES (docker ps -a)

CONTAINER ID   IMAGE                 COMMAND        CREATED       STATUS             PORTS                                                                      NAMES
6449ec0061a0   local_discourse/app   "/sbin/boot"   7 weeks ago   Up About an hour   0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp   app


Discourse container app is running


@rmccown Tried both start and restart without luck.

Still get this at the URL.

image

Running sudo ./launcher rebuild app again fails with these messages (currently looking through the file for earlier error messages).

When all else fails - have you tried unplugging it and plugging back in? :wink:

Reboot the instance: reboot now

Bring the instance up to latest: apt-get update and apt-get dist-upgrade

Then run the Discourse upgrade.

1 Like

@omarfilip That’s the first thing I tried :wink:

I just tried a reboot again (no Ubuntu 18.04.6 components to upgrade from earlier upgrade).

Same results - 502 Bad Gateway error at URL.

But thanks for the suggestion.

1 Like

Also just found this from November 2020 - Upgrade ends with FAILED TO BOOTSTRAP.

But I have to admit I don’t know exactly what it means to “follow our default release channel.” I assume that upgrading via the Web UI is the “default release channel.”

That would be “tests-passed.” If you have not modified this line in your app.yaml file, then you are on the default release channel:

  ## Which Git revision should this container use? (default: tests-passed)
  #version: tests-passed

can you share the whole log, this is not enough to see the actual error.

I am using the default release channel.

And I my db_shared_buffers value is not 0MB (found that issue here).

@IAmGav Here is a gist that contains the ./launcher rebuild app output - link

The only indication of ERROR is in the following group of lines (see below).

They seem to be related to database and role creation. Is there any way to bypass these actions? (which I think you would want to do during an upgrade since you working with a pre-existing instance)

Lines 88-95

2021-11-25 21:21:38.451 UTC [64] postgres@postgres ERROR:  database "discourse" already exists
2021-11-25 21:21:38.451 UTC [64] postgres@postgres STATEMENT:  CREATE DATABASE discourse;
createdb: error: database creation failed: ERROR:  database "discourse" already exists
I, [2021-11-25T21:21:38.454429 #1]  INFO -- :
I, [2021-11-25T21:21:38.454908 #1]  INFO -- : > su postgres -c 'psql discourse -c "create user discourse;"' || true
2021-11-25 21:21:38.531 UTC [68] postgres@discourse ERROR:  role "discourse" already exists
2021-11-25 21:21:38.531 UTC [68] postgres@discourse STATEMENT:  create user discourse;
ERROR:  role "discourse" already exists

Literally the exact same thing happened to me a while back, an update failed in the middle of a Web UI update; even the error messages after an attempted rebuild are pretty much the same and it’s still unresolved; only difference is I could update via the Web UI for a while, except a day or two ago, after not updating for two or so weeks, it now has the “You are running an old version of the Discourse image” notice, and now I can’t update at all. :upside_down_face:

Apparently it’s an issue with redis.

Everyone - Updated my original post with information gathered throughout the day. Thanks everyone who has posted so far.

No. There is another error. Try to remove that plugin.

Gem::ConflictError: Unable to activate omniauth-vkontakte-1.7.0, because omniauth-oauth2-1.7.2 conflicts with omniauth-oauth2 (>= 1.5, <= 1.7.1)

3 Likes

Following Michael’s advice in the above post, I commented out a plug-in in the app.yml file that was from an initial attempt at SSO authentication using the VK plug-in (we never went with this implementation but obviously forgot to remove the plug-in from the app.yml file).

## Plugins go here
## see https://meta.discourse.org/t/19157 for details
hooks:
  after_code:
    - exec:
        cd: $home/plugins
        cmd:
          - git clone https://github.com/discourse/docker_manager.git
##          - git clone https://github.com/discourse/discourse-vk-auth.git

After commenting out the line above I ran sudo ./launcher rebuild app again. After the rebuild the forum site appears to be up and running (testing now).

Thanks again for everyone who took the time to review my posts and comment. Your help was very much appreciated (what better way to spend a the Thanksgiving holiday here in US :wink:).

Stay safe.

4 Likes