PostgreSQL 13 update

root@forum:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            7.9G     0  7.9G   0% /dev
tmpfs           1.6G  7.5M  1.6G   1% /run
/dev/sda        315G  170G  129G  57% /
tmpfs           7.9G     0  7.9G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
overlay         315G  170G  129G  57% /var/lib/docker/overlay2/fdaa955ab22c1f85f463b6e3cc48e7f87426c4c0d349699d45b76425fe113f86/merged
tmpfs           1.6G     0  1.6G   0% /run/user/1000
root@forum:~# free -h
              total        used        free      shared  buff/cache   available
Mem:            15G        2.7G        292M         10G         12G        2.3G
Swap:          255M        162M         93M
2 Likes

You might be able to find this file with something like

find / -name pg_upgrade\*.log

2 Likes

I tried that and no, it isn’t anywhere. My theory is it’s inside the temporary docker container used for the upgrade and thus removed when it finishes running.

4 Likes

I wonder, as you’re at the command line, if you could hit ^Z at the right moment to stop the process after it has logged the problem but before the script around it has deleted the log. Then try the find.

Edit: and then do cont to let the process run to completion, cleanly.

I’m really not comfortable doing that. If it messes up and I need to restore a snapshot, that would take the site down for a much longer time than just a launcher rebuild app.

2 Likes

So the issue is, you’re still stuck with the space limitations? There’s no way to wave a magic wand, give yourself +100gb, then roll it back to a smaller disk later?

Not complaining about this, just trying to understand where you are so we can help.

2 Likes

Hey @Wingtip

Recently, we noticed our used disk space went from 41% to 85%, and that is when I found all the “orphaned” docker images; and when I purged them, all the disk space returned; and that was a lot of reclaimed disk space.

Hence, you may also check all your docker images and make sure you have deleted and purged all unused images and containers.

Check images:

docker images -a

When we attempt to rebuild containers and these tasks fail, this process generally leaves a lot of “wasteful” images which can be hidden from admins unaware of these images.

We purge these images often with:

docker system prune -af

There is also a similar wrapper command with launcher, but I prefer working directly using docker commands. You can use the launcher cleanup command as well, if you wish; but I don’t remember the syntax OTTOMH.

2 Likes

Is this best regarded as a bug in the upgrade script, if it tidies up so thoroughly that the explanatory log file is removed before we have a chance to read it? (Could the script perhaps tail -99 the logfile before removal, if it must remove?)

1 Like

The safest solution might be to migrate to a new server. You’d just do a new install and restore the backup. If anything goes wrong you have zero down time. When you think it works, you make the site read only, restore to the new server and change the dns.

1 Like

Yes that’s right. We can add a separate volume temporarily but not expand the existing volume. And I mean, the manual upgrade worked from 10->12, it just isn’t working from 12->13 for some reason.

@neounix: The prune got us 3.5GB, not enough to make a difference.

@Ed_S: I don’t know if it’s a bug but it would certainly be more robust if it put its logs outside the container so people could look at them.

@pfaffman: Yes that’s the nuclear option, would like to avoid building out a whole new host just to upgrade the database!

4 Likes

Hi @Wingtip, to debug failed upgrades using the space-constrained method you can try the following

docker run \
	-v /var/discourse/shared/standalone/postgres_data:/var/lib/postgresql/12/data \
	-v /var/discourse/shared/standalone/postgres_data_new:/var/lib/postgresql/13/data \
	tianon/postgres-upgrade:12-to-13

note that there is no --rm flag

Once the upgrade fails, the container will be in the ‘stopped’ state. You’ll need to find the container ID using docker ps -a

root@fitzy-testing:~# docker ps -a
CONTAINER ID   IMAGE                              COMMAND                  CREATED          STATUS                      PORTS                                      NAMES
4488419cbe93   tianon/postgres-upgrade:12-to-13   "docker-upgrade pg_u…"   22 seconds ago   Exited (1) 20 seconds ago                                              busy_napier
cbb267224d23   local_discourse/app                "/sbin/boot"             3 weeks ago      Up 3 weeks                  0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp   app

See here the container name was busy_napier or alternatively 4488419cbe93.

To access the log, commit the failed container to a new image, so we can start the container with a different entrypoint

root@fitzy-testing:~# docker commit busy_napier debug_failed_upgrade
sha256:25f28eff29fece0d75bdd804cadfaeb99db434e9d14f573702e5f36d0dcf7490

Now, you can start a new container with the filesystem intact from the failed upgrade

root@fitzy-testing:~# docker run -it --entrypoint /bin/bash  debug_failed_upgrade
root@7126d8d6d0cf:/var/lib/postgresql# ls
12  13  data  pg_upgrade_internal.log  pg_upgrade_server.log  pg_upgrade_utility.log
root@7126d8d6d0cf:/var/lib/postgresql# cat pg_upgrade_server.log

-----------------------------------------------------------------
  pg_upgrade run on Sun Feb 28 23:27:30 2021
-----------------------------------------------------------------

command: "/usr/lib/postgresql/12/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/postgresql/12/data" -o "-p 50432 -c autovacuum=off -c autovacuum_freeze_max_age=2000000000  -c listen_addresses='' -c unix_socket_permissions=0700" start >> "pg_upgrade_server.log" 2>&1
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2021-02-28 23:27:30.657 UTC [37] FATAL:  database files are incompatible with server
2021-02-28 23:27:30.657 UTC [37] DETAIL:  The data directory was initialized by PostgreSQL version 13, which is not compatible with this version 12.6 (Debian 12.6-1.pgdg100+1).
 stopped waiting
pg_ctl: could not start server
Examine the log output.

You can see here that my upgrade failed (expectedly) because I was attempting to upgrade an instance that was already on PG13. Remember to remove the stopped container once you’ve debugged!

6 Likes

@fitzy thanks! The log output is below. This is the entire log.

root@c70d93ad0cb9:/var/lib/postgresql# cat pg_upgrade_dump_16583.log
command: "/usr/lib/postgresql/13/bin/pg_dump" --host /var/lib/postgresql --port 50432 --username postgres --schema-only --quote-all-identifiers --binary-upgrade --format=custom  --file="pg_upgrade_dump_16583.custom" 'dbname=discourse' >> "pg_upgrade_dump_16583.log" 2>&1
pg_dump: error: query returned 2 rows instead of one: SELECT c.reltoastrelid, i.indexrelid FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_index i ON (c.reltoastrelid = i.indrelid AND i.indisvalid) WHERE c.oid = '2782645'::pg_catalog.oid;
4 Likes

It appears to come from this line:

Which I think calls:

A similar function in another module has the following comment:

Catalog corruption might elicit other row counts.

5 Likes

It’s possible that query is failing due to duplicate indexes existing, have you re-indexed recently at all? It may clean up any duplicates and avoid this failure.

6 Likes

No, not since the PG12 upgrade. Happy to run whatever you suggest to fix it.

2 Likes

@Wingtip you should be able to follow the same directions as in the OP for reindexing concurrently, this is supported in Postgres 12.

3 Likes

Just ran it, exact same issue.

1 Like

Can you run this query and paste the output here:

SELECT *
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_index i 
    ON (c.reltoastrelid = i.indrelid AND i.indisvalid)
WHERE c.oid = '2782645'::pg_catalog.oid;
1 Like

That returns zero rows.

Also the latest rebuild messed up our forum’s formatting, even though I didn’t change anything at all, so now users are pissed off.

2 Likes

Hmm, well, CSS issues are usually easy to fix, so let’s stay on target here :dart: with the database.

1 Like