Upgrade failed due to unclean database shutdown


(ljpp) #1

I was doing some routine on my box and also decided upgrade Discourse via ./launcher rebuild. Surprisingly, it failed. Never seen this before. I am on stable branch, so the update was from 1.6.7 to 1.6.9.

./launcher rebuild app
Ensuring launcher is up to date
Fetching origin
Launcher is up-to-date
Stopping old container
+ /usr/bin/docker stop -t 10 app
app
cd /pups && git pull && /pups/bin/pups --stdin
Already up-to-date.
I, [2016-12-17T09:18:04.758572 #14]  INFO -- : Loading --stdin
I, [2016-12-17T09:18:04.784722 #14]  INFO -- : > mkdir -p /shared/postgres_run
I, [2016-12-17T09:18:04.791334 #14]  INFO -- : 
I, [2016-12-17T09:18:04.792137 #14]  INFO -- : > chown postgres:postgres /shared/postgres_run
I, [2016-12-17T09:18:04.798552 #14]  INFO -- : 
I, [2016-12-17T09:18:04.799377 #14]  INFO -- : > chmod 775 /shared/postgres_run
I, [2016-12-17T09:18:04.803513 #14]  INFO -- : 
I, [2016-12-17T09:18:04.804269 #14]  INFO -- : > rm -fr /var/run/postgresql
I, [2016-12-17T09:18:04.808721 #14]  INFO -- : 
I, [2016-12-17T09:18:04.809443 #14]  INFO -- : > ln -s /shared/postgres_run /var/run/postgresql
I, [2016-12-17T09:18:04.813831 #14]  INFO -- : 
I, [2016-12-17T09:18:04.814460 #14]  INFO -- : > socat /dev/null UNIX-CONNECT:/shared/postgres_run/.s.PGSQL.5432 || exit 0 && echo postgres already running stop container ; exit 1
2016/12/17 09:18:04 socat[22] E connect(6, AF=1 "/shared/postgres_run/.s.PGSQL.5432", 36): Connection refused
I, [2016-12-17T09:18:04.822875 #14]  INFO -- : 
I, [2016-12-17T09:18:04.823462 #14]  INFO -- : > rm -fr /shared/postgres_run/.s*
I, [2016-12-17T09:18:04.829064 #14]  INFO -- : 
I, [2016-12-17T09:18:04.829724 #14]  INFO -- : > rm -fr /shared/postgres_run/*.pid
I, [2016-12-17T09:18:04.835050 #14]  INFO -- : 
I, [2016-12-17T09:18:04.835530 #14]  INFO -- : > mkdir -p /shared/postgres_run/9.5-main.pg_stat_tmp
I, [2016-12-17T09:18:04.840017 #14]  INFO -- : 
I, [2016-12-17T09:18:04.840702 #14]  INFO -- : > chown postgres:postgres /shared/postgres_run/9.5-main.pg_stat_tmp
I, [2016-12-17T09:18:04.845316 #14]  INFO -- : 
I, [2016-12-17T09:18:04.854594 #14]  INFO -- : File > /etc/service/postgres/run  chmod: +x
I, [2016-12-17T09:18:04.863172 #14]  INFO -- : File > /etc/runit/3.d/99-postgres  chmod: +x
I, [2016-12-17T09:18:04.871895 #14]  INFO -- : File > /root/upgrade_postgres  chmod: +x
I, [2016-12-17T09:18:04.872971 #14]  INFO -- : > chown -R root /var/lib/postgresql/9.5/main
I, [2016-12-17T09:18:05.067071 #14]  INFO -- : 
I, [2016-12-17T09:18:05.067390 #14]  INFO -- : > [ ! -e /shared/postgres_data ] && install -d -m 0755 -o postgres -g postgres /shared/postgres_data && sudo -E -u postgres /usr/lib/postgresql/9.5/bin/initdb -D /shared/postgres_data || exit 0
I, [2016-12-17T09:18:05.072457 #14]  INFO -- : 
I, [2016-12-17T09:18:05.073063 #14]  INFO -- : > chown -R postgres:postgres /shared/postgres_data
I, [2016-12-17T09:18:05.094604 #14]  INFO -- : 
I, [2016-12-17T09:18:05.095566 #14]  INFO -- : > chown -R postgres:postgres /var/run/postgresql
I, [2016-12-17T09:18:05.101806 #14]  INFO -- : 
I, [2016-12-17T09:18:05.102894 #14]  INFO -- : > /root/upgrade_postgres
I, [2016-12-17T09:18:05.112732 #14]  INFO -- : 
I, [2016-12-17T09:18:05.113478 #14]  INFO -- : > rm /root/upgrade_postgres
I, [2016-12-17T09:18:05.117602 #14]  INFO -- : 
I, [2016-12-17T09:18:05.118501 #14]  INFO -- : Replacing data_directory = '/var/lib/postgresql/9.5/main' with data_directory = '/shared/postgres_data' in /etc/postgresql/9.5/main/postgresql.conf
I, [2016-12-17T09:18:05.120126 #14]  INFO -- : Replacing (?-mix:#?listen_addresses *=.*) with listen_addresses = '*' in /etc/postgresql/9.5/main/postgresql.conf
I, [2016-12-17T09:18:05.121516 #14]  INFO -- : Replacing (?-mix:#?synchronous_commit *=.*) with synchronous_commit = $db_synchronous_commit in /etc/postgresql/9.5/main/postgresql.conf
I, [2016-12-17T09:18:05.122976 #14]  INFO -- : Replacing (?-mix:#?shared_buffers *=.*) with shared_buffers = $db_shared_buffers in /etc/postgresql/9.5/main/postgresql.conf
I, [2016-12-17T09:18:05.124371 #14]  INFO -- : Replacing (?-mix:#?work_mem *=.*) with work_mem = $db_work_mem in /etc/postgresql/9.5/main/postgresql.conf
I, [2016-12-17T09:18:05.125726 #14]  INFO -- : Replacing (?-mix:#?default_text_search_config *=.*) with default_text_search_config = '$db_default_text_search_config' in /etc/postgresql/9.5/main/postgresql.conf
I, [2016-12-17T09:18:05.126940 #14]  INFO -- : > install -d -m 0755 -o postgres -g postgres /shared/postgres_backup
I, [2016-12-17T09:18:05.132320 #14]  INFO -- : 
I, [2016-12-17T09:18:05.133328 #14]  INFO -- : Replacing (?-mix:#?max_wal_senders *=.*) with max_wal_senders = $db_max_wal_senders in /etc/postgresql/9.5/main/postgresql.conf
I, [2016-12-17T09:18:05.134681 #14]  INFO -- : Replacing (?-mix:#?wal_level *=.*) with wal_level = $db_wal_level in /etc/postgresql/9.5/main/postgresql.conf
I, [2016-12-17T09:18:05.136064 #14]  INFO -- : Replacing (?-mix:#?checkpoint_segments *=.*) with checkpoint_segments = $db_checkpoint_segments in /etc/postgresql/9.5/main/postgresql.conf
I, [2016-12-17T09:18:05.137345 #14]  INFO -- : Replacing (?-mix:#?logging_collector *=.*) with logging_collector = $db_logging_collector in /etc/postgresql/9.5/main/postgresql.conf
I, [2016-12-17T09:18:05.138644 #14]  INFO -- : Replacing (?-mix:#?log_min_duration_statement *=.*) with log_min_duration_statement = $db_log_min_duration_statement in /etc/postgresql/9.5/main/postgresql.conf
I, [2016-12-17T09:18:05.139813 #14]  INFO -- : Replacing (?-mix:^#local +replication +postgres +peer$) with local replication postgres  peer in /etc/postgresql/9.5/main/pg_hba.conf
I, [2016-12-17T09:18:05.140959 #14]  INFO -- : Replacing (?-mix:^host.*all.*all.*127.*$) with host all all 0.0.0.0/0 md5 in /etc/postgresql/9.5/main/pg_hba.conf
I, [2016-12-17T09:18:05.141949 #14]  INFO -- : > HOME=/var/lib/postgresql USER=postgres exec chpst -u postgres:postgres:ssl-cert -U postgres:postgres:ssl-cert /usr/lib/postgresql/9.5/bin/postmaster -D /etc/postgresql/9.5/main
I, [2016-12-17T09:18:05.145214 #14]  INFO -- : > sleep 5
2016-12-17 09:18:05 UTC [46-1] LOG:  database system was interrupted; last known up at 2016-12-17 08:57:21 UTC
I, [2016-12-17T09:18:10.149505 #14]  INFO -- : 
I, [2016-12-17T09:18:10.150408 #14]  INFO -- : > su postgres -c 'createdb discourse' || true
2016-12-17 09:18:10 UTC [50-1] postgres@postgres FATAL:  the database system is starting up
2016-12-17 09:18:10 UTC [51-1] postgres@template1 FATAL:  the database system is starting up
createdb: could not connect to database template1: FATAL:  the database system is starting up
I, [2016-12-17T09:18:10.283260 #14]  INFO -- : 
I, [2016-12-17T09:18:10.284283 #14]  INFO -- : > su postgres -c 'psql discourse -c "create user discourse;"' || true
2016-12-17 09:18:10 UTC [62-1] postgres@discourse FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
I, [2016-12-17T09:18:10.433879 #14]  INFO -- : 
I, [2016-12-17T09:18:10.434862 #14]  INFO -- : > su postgres -c 'psql discourse -c "grant all privileges on database discourse to discourse;"' || true
2016-12-17 09:18:10 UTC [73-1] postgres@discourse FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
I, [2016-12-17T09:18:10.580972 #14]  INFO -- : 
I, [2016-12-17T09:18:10.581433 #14]  INFO -- : > su postgres -c 'psql discourse -c "alter schema public owner to discourse;"'
2016-12-17 09:18:10 UTC [84-1] postgres@discourse FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
I, [2016-12-17T09:18:10.726359 #14]  INFO -- : 
I, [2016-12-17T09:18:10.727477 #14]  INFO -- : Terminating async processes
I, [2016-12-17T09:18:10.727748 #14]  INFO -- : Sending INT to HOME=/var/lib/postgresql USER=postgres exec chpst -u postgres:postgres:ssl-cert -U postgres:postgres:ssl-cert /usr/lib/postgresql/9.5/bin/postmaster -D /etc/postgresql/9.5/main pid: 43
2016-12-17 09:18:10 UTC [43-1] LOG:  received fast shutdown request
I, [2016-12-17T09:18:20.729327 #14]  INFO -- : HOME=/var/lib/postgresql USER=postgres exec chpst -u postgres:postgres:ssl-cert -U postgres:postgres:ssl-cert /usr/lib/postgresql/9.5/bin/postmaster -D /etc/postgresql/9.5/main pid:43 did not terminate cleanly, forcing termination!


FAILED
--------------------
Pups::ExecError: su postgres -c 'psql discourse -c "alter schema public owner to discourse;"' failed with return #<Process::Status: pid 74 exit 2>
Location of failure: /pups/lib/pups/exec_command.rb:108:in `spawn'
exec failed with the params "su postgres -c 'psql $db_name -c \"alter schema public owner to $db_user;\"'"
a00b2a37440a0d25746f80f9b2efb53e606e63216ea1a1014b2e440f9c6e3d0a
** FAILED TO BOOTSTRAP ** please scroll up and look for earlier error messages, there may be more than one

So site down. Any ideas?

Edit: Actually, site not 100% down. Came up with start app, but still at 1.6.7.
Edit 2: Wrote this post, had a cup of coffee, tried again with the rebuild and it looks like it is going to work now. Wtf.


(ljpp) #2

And it succeeded now, with no apparent reason why. Mind boggling.

Another weird thing I noticed. The system had not send digest emails in month, but now spat out emails to all users. It also seems that it had not sent notifiers to users, as I got several notifications when the site came back up. Something appears to have been frozen, without me knowning about it.The site’s basic features worked just fine.

Edit: Noticed that automated backups had also frozen a month ago. And this has happened not on one, but two of my smaller instances. Interestingly, both are hosted on ScaleWay.

Edit 2: It would nice to have a warning if automated backups, or other configured background task are not successfully completed according to scheduled interval?

I can’t remember what has happened in early November, possibly causing the freeze on two sites, but I have a faint recollection that I might done a minor upgrade via admin UI at that time. Usually I always upgrade via rebuild.


(ljpp) #3

OMG, some serious shit going on with one of the problematic sites https://cd-rw.org

I restarted the App and site came back to life. Did a manual backup, but it is several megabytes smaller than the previous backup from early November. Did a rebuild and the site lost a lot of the settings and they were replaced with unconfigured defaults: All basic site setting gone (logos, names, email, third party login keys…), ad plugin settings wiped too.

Luckily all the discussion threads seem to be in place, but it looks like I have some serious reconfiguring to do. :frowning: Not my kind of saturday…

@codinghorror Is there a way to import only the site settings from an older backup?

Edit: What the f… is going on in here. Did a stop/start app, and missing settings came back…This is crazy.


(Rafael dos Santos Silva) #4

Reading the log, your postgresql was stopped abruptly, so it needs time to recover itself. The problem is our script doesn’t wait that much for a fucked up pg to recover so you will need some fiddling.


(ljpp) #5

Yeah, I guessed something like that. Very interesting that this happened on both of my Scaleway instances, but never seen this before. Also, there might be room for improvement that before the 1st rebuild attempt there was no indication what so ever that something was wrong. Site basically worked, one could write new posts etc - only the background tasks were frozen.

Also, when I noticed that settings were gone, I started re-configuring the basic settings. I was quite bummed when the settings (the ones I was yet to reconfigure) came back.


(ljpp) #6

Spotted something interesting in the cheap hosting thread:

It is actually so slow that templates/postgres.template.yml had to be manually modified from sleep 5 to sleep 60 for postgres.

Or could @zsero share us what kind of issues he experienced before making the modification?


(Zsolt Ero) #7

Hi, yes, exactly this error.

What was happening for me was the following, and I believe the same is true in your case. I was on Scaleway VC1S.

The stop timeout is not enough for this system, after 10 seconds PostgreSQL is still running. Now with -t 10 we are sending a kill signal to PostgreSQL which results in an unclean shutdown case.
This is problem #1.

This means that a db recovery has to run on next startup.

But 5 seconds is not enough for a recovery, so a startup cannot continue. This is problem #2.

Workaround for now:

  1. Do not use ./launcher rebuild, etc. without manually shutting down the instance. Use docker stop app and wait, wait. It will take more than 10 seconds!

  2. Modify the postgresql template and change sleep 5 to sleep 60. This is enough time to have the recovery complete. Your system will work afterwards, but this will leave that file out of sync with the git repo.

Proper fix would be:

  1. Raise the docker stop timeout. I don’t have the host anymore, but could you run time docker stop app and tell us how much it takes for you on a running instance?

  2. Change the postgres template to something dynamic instead of a fixed timeout. I’ve opened this ticket for a proper solution, but will need some help/input to continue: https://github.com/discourse/discourse_docker/issues/309


The cheapest solution for hosting?
ScaleWay review?
(Jeff Atwood) #8

We should increase our timeouts here @falco to make them safer.


(Sam Saffron) #9

Sleep 60 means everyone waits 60 seconds EVERY time a rebuild happens, just make it a param and the 1 in 1000 that has issues can bump it up in container yaml


(Rafael dos Santos Silva) #10

I think @sam wants something like this:


(Sam Saffron) #11

either that or the original proposal which made it dynamic if you can swing it.

https://github.com/discourse/discourse_docker/issues/309


(Jeff Atwood) #12

Hmm I thought this was a maximum amount of time to wait for Postgres to cleanly shut down, versus an absolute number.


(Rafael dos Santos Silva) #13

looks like pg_ctl -w does all the magic for us.

@ljpp said scaleway is slow, can you repro the problem consistently? I would like to test if this PR solves running in slow hardware


(Daniel) #14

I`m experiencing similar behavior on ScaleWay. Only looks like in my case backups are more badly corrupted. Backup files are smaller than before and any restoration attempt fails.(Restoration produces errors like these)


(ljpp) #15

@Falco I can’t say consistently, as I encountered this once and the story line is there in the first posts. After reading @zsero comments about the time-outs, I just moved my small intances to Hetzner. So I don’t have anything running at Scaleway at this moment.

@MerryMinstrel Mine did recover eventually and so did the content of the backups. In the end, I do not see anything missing or lost.


(Zsolt Ero) #16

Daniel, backup files were not corrupted in my case, just the postgres database was shut down uncleanly.

I don’t think your backups are corrupted, they just need to be run on the same major version of postgres as they were created on.

Do you by any chance use the Discourse image from Scaleway’s ImageHub? In which case I strongly recommend you not to but create a clean Ubuntu 16.04 VPS and follow the official readme. Also, don’t forget to choose the Docker kernel in advanced settings.


(Daniel) #17

Well that`s not reassuring. :sweat_smile: Maybe I should start considering manual data retrieval? If it is an option.

Both of my installs are made on clean Ubuntu 16.04 VPS following official readme. Both instances are running on psql (PostgreSQL) 9.5.5. That was the first thing I checked. :confused: At ScaleWay I have chosen Docker kernel. My other install is on DigitalOcean 1GB RAM. As far as I can check everything is up to date.


(Zsolt Ero) #18

I recommend you to start a new thread, and state which is the old host, which is the new host, and share the exact errors when trying to restore. Also, is your Docker running on aufs?

docker info | grep Storage


(Daniel) #19

Thanks, I will.

Yes, it is. Storage Driver: aufs


(Rafael dos Santos Silva) #20

Just got a brand new server at scaleway, the cheapest possible, but can’t get postgresql to be so slow on startup…