root@forum:~# df -h Filesystem Size Used Avail Use% Mounted on udev 7.9G 0 7.9G 0% /dev tmpfs 1.6G 7.5M 1.6G 1% /run /dev/sda 315G 170G 129G 57% / tmpfs 7.9G 0 7.9G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup overlay 315G 170G 129G 57% /var/lib/docker/overlay2/fdaa955ab22c1f85f463b6e3cc48e7f87426c4c0d349699d45b76425fe113f86/merged tmpfs 1.6G 0 1.6G 0% /run/user/1000 root@forum:~# free -h total used free shared buff/cache available Mem: 15G 2.7G 292M 10G 12G 2.3G Swap: 255M 162M 93M
You might be able to find this file with something like
find / -name pg_upgrade\*.log
I tried that and no, it isn’t anywhere. My theory is it’s inside the temporary docker container used for the upgrade and thus removed when it finishes running.
I wonder, as you’re at the command line, if you could hit ^Z at the right moment to stop the process after it has logged the problem but before the script around it has deleted the log. Then try the
Edit: and then do
cont to let the process run to completion, cleanly.
I’m really not comfortable doing that. If it messes up and I need to restore a snapshot, that would take the site down for a much longer time than just a launcher rebuild app.
So the issue is, you’re still stuck with the space limitations? There’s no way to wave a magic wand, give yourself +100gb, then roll it back to a smaller disk later?
Not complaining about this, just trying to understand where you are so we can help.
Recently, we noticed our used disk space went from 41% to 85%, and that is when I found all the “orphaned” docker images; and when I purged them, all the disk space returned; and that was a lot of reclaimed disk space.
Hence, you may also check all your docker images and make sure you have deleted and purged all unused images and containers.
docker images -a
When we attempt to rebuild containers and these tasks fail, this process generally leaves a lot of “wasteful” images which can be hidden from admins unaware of these images.
We purge these images often with:
docker system prune -af
There is also a similar wrapper command with
launcher, but I prefer working directly using docker commands. You can use the
launcher cleanup command as well, if you wish; but I don’t remember the syntax OTTOMH.
Is this best regarded as a bug in the upgrade script, if it tidies up so thoroughly that the explanatory log file is removed before we have a chance to read it? (Could the script perhaps
tail -99 the logfile before removal, if it must remove?)
The safest solution might be to migrate to a new server. You’d just do a new install and restore the backup. If anything goes wrong you have zero down time. When you think it works, you make the site read only, restore to the new server and change the dns.
Yes that’s right. We can add a separate volume temporarily but not expand the existing volume. And I mean, the manual upgrade worked from 10->12, it just isn’t working from 12->13 for some reason.
@neounix: The prune got us 3.5GB, not enough to make a difference.
@Ed_S: I don’t know if it’s a bug but it would certainly be more robust if it put its logs outside the container so people could look at them.
@pfaffman: Yes that’s the nuclear option, would like to avoid building out a whole new host just to upgrade the database!
Hi @Wingtip, to debug failed upgrades using the space-constrained method you can try the following
docker run \ -v /var/discourse/shared/standalone/postgres_data:/var/lib/postgresql/12/data \ -v /var/discourse/shared/standalone/postgres_data_new:/var/lib/postgresql/13/data \ tianon/postgres-upgrade:12-to-13
note that there is no
Once the upgrade fails, the container will be in the ‘stopped’ state. You’ll need to find the container ID using
docker ps -a
root@fitzy-testing:~# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 4488419cbe93 tianon/postgres-upgrade:12-to-13 "docker-upgrade pg_u…" 22 seconds ago Exited (1) 20 seconds ago busy_napier cbb267224d23 local_discourse/app "/sbin/boot" 3 weeks ago Up 3 weeks 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp app
See here the container name was
busy_napier or alternatively
To access the log, commit the failed container to a new image, so we can start the container with a different entrypoint
root@fitzy-testing:~# docker commit busy_napier debug_failed_upgrade sha256:25f28eff29fece0d75bdd804cadfaeb99db434e9d14f573702e5f36d0dcf7490
Now, you can start a new container with the filesystem intact from the failed upgrade
root@fitzy-testing:~# docker run -it --entrypoint /bin/bash debug_failed_upgrade root@7126d8d6d0cf:/var/lib/postgresql# ls 12 13 data pg_upgrade_internal.log pg_upgrade_server.log pg_upgrade_utility.log root@7126d8d6d0cf:/var/lib/postgresql# cat pg_upgrade_server.log ----------------------------------------------------------------- pg_upgrade run on Sun Feb 28 23:27:30 2021 ----------------------------------------------------------------- command: "/usr/lib/postgresql/12/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/postgresql/12/data" -o "-p 50432 -c autovacuum=off -c autovacuum_freeze_max_age=2000000000 -c listen_addresses='' -c unix_socket_permissions=0700" start >> "pg_upgrade_server.log" 2>&1 pg_ctl: another server might be running; trying to start server anyway waiting for server to start....2021-02-28 23:27:30.657 UTC  FATAL: database files are incompatible with server 2021-02-28 23:27:30.657 UTC  DETAIL: The data directory was initialized by PostgreSQL version 13, which is not compatible with this version 12.6 (Debian 12.6-1.pgdg100+1). stopped waiting pg_ctl: could not start server Examine the log output.
You can see here that my upgrade failed (expectedly) because I was attempting to upgrade an instance that was already on PG13. Remember to remove the stopped container once you’ve debugged!
@fitzy thanks! The log output is below. This is the entire log.
root@c70d93ad0cb9:/var/lib/postgresql# cat pg_upgrade_dump_16583.log command: "/usr/lib/postgresql/13/bin/pg_dump" --host /var/lib/postgresql --port 50432 --username postgres --schema-only --quote-all-identifiers --binary-upgrade --format=custom --file="pg_upgrade_dump_16583.custom" 'dbname=discourse' >> "pg_upgrade_dump_16583.log" 2>&1 pg_dump: error: query returned 2 rows instead of one: SELECT c.reltoastrelid, i.indexrelid FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_index i ON (c.reltoastrelid = i.indrelid AND i.indisvalid) WHERE c.oid = '2782645'::pg_catalog.oid;
It appears to come from this line:
Which I think calls:
A similar function in another module has the following comment:
Catalog corruption might elicit other row counts.
It’s possible that query is failing due to duplicate indexes existing, have you re-indexed recently at all? It may clean up any duplicates and avoid this failure.
No, not since the PG12 upgrade. Happy to run whatever you suggest to fix it.
@Wingtip you should be able to follow the same directions as in the OP for reindexing concurrently, this is supported in Postgres 12.
Just ran it, exact same issue.
Can you run this query and paste the output here:
SELECT * FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_index i ON (c.reltoastrelid = i.indrelid AND i.indisvalid) WHERE c.oid = '2782645'::pg_catalog.oid;
That returns zero rows.
Also the latest rebuild messed up our forum’s formatting, even though I didn’t change anything at all, so now users are pissed off.
Hmm, well, CSS issues are usually easy to fix, so let’s stay on target here with the database.