TL;DR: We messed up an upgrade and are looking for help
With Home Assistant we use Discourse to power our community. We run with the discourse_docker method on an EC2 instance on AWS.
Being an open source project, forum maintenance fell through the cracks and we ended up with an old version, last updated beginning of 2019.
To make it worse, a prior upgrade we had pinned Postgres to 9.5 because we didn’t had the necessary disk space to upgrade to Postgres 10. We never resolved that issue.
We also had once made a change to the cloudflare template and had committed that to the repo, this prevented the docker_discourse branch from updating itself to the latest version.
Yesterday we decided to go for the upgrade…
When the database was getting migrated, we ran into an issue that syntax was used that was not compatible with 9.5:
== 20200429095034 AddTopicThumbnailInformation: migrating =====================
-- execute("ALTER TABLE posts\nADD COLUMN IF NOT EXISTS image_upload_id bigint\n")
We realized the issue of having 9.5 pinned pretty fast. So decided to migrate to Postgres 10. That didn’t work and we got the error:
I, [2020-06-12T00:30:55.448351 #1] INFO -- : Upgrading PostgreSQL from version 9.5 to 10
WARNING: Upgrading PostgresSQL would require an addtional 89M of disk space
Please free up some space, or expand your disk, before continuing.
We had 47G available, so that was weird. We then realized that discourse_docker
was out of date and so updated to the latest version. Surprise, Postgres 12 had just landed.
After running rebuild again, this time we got this error
I, [2020-06-12T00:41:17.378129 #1] INFO -- : Upgrading PostgreSQL from version 9.5 to 12
WARNING: Upgrading PostgresSQL would require an addtional 92G of disk space
Please free up some space, or expand your disk, before continuing.
That’s a little more space, but sure. Let’s just bump our disk space to 300G and let’s run it again.
This time pg_upgrade broke during the migration:
Restoring database schemas in the new cluster
template1
discourse
*failure* Consult the last few lines of "pg_upgrade_dump_16384.log" for the probable cause of the failure. Failure, exiting
When we looked at the pg_upgrade_dump_16384.log
file, we saw the following error:
pg_restore: creating VIEW "postgres_exporter.pg_stat_activity"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 721; 1259 678554 VIEW pg_stat_activity postgres
pg_restore: [archiver (db)] could not execute query: ERROR: column pg_stat_activity.waiting does not exist
LINE 27: "pg_stat_activity"."waiting",
^
Command was:
-- For binary upgrade, must preserve pg_type oid
SELECT pg_catalog.binary_upgrade_set_next_pg_type_oid('678556'::pg_catalog.oid);
-- For binary upgrade, must preserve pg_type array oid
SELECT pg_catalog.binary_upgrade_set_next_array_pg_type_oid('678555'::pg_catalog.oid);
-- For binary upgrade, must preserve pg_class oids
SELECT pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('678554'::pg_catalog.oid);
CREATE VIEW "postgres_exporter"."pg_stat_activity" AS
SELECT "pg_stat_activity"."datid",
"pg_stat_activity"."datname",
"pg_stat_activity"."pid",
"pg_stat_activity"."usesysid",
"pg_stat_activity"."usename",
"pg_stat_activity"."application_name",
"pg_stat_activity"."client_addr",
"pg_stat_activity"."client_hostname",
"pg_stat_activity"."client_port",
"pg_stat_activity"."backend_start",
"pg_stat_activity"."xact_start",
"pg_stat_activity"."query_start",
"pg_stat_activity"."state_change",
"pg_stat_activity"."waiting",
"pg_stat_activity"."state",
"pg_stat_activity"."backend_xid",
"pg_stat_activity"."backend_xmin",
"pg_stat_activity"."query"
FROM "pg_stat_activity";
Oh snap.
So this is where we decided to take a couple of steps back. Could we just get the forums up and running again and put it in read only mode while we figure out this backup business. We managed to do this by fixing some permission issues for both postgres
and redis
and the forums got back up online on the old version. Not everything works, ie going to admin -> user -> groups gets us this error:
NoMethodError (undefined method `automatic_membership_retroactive' for #<Group:0x00007fcaca3045e8>)
/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/activemodel-6.0.1/lib/active_model/attribute_methods.rb:431:in `method_missing'
But rest seems to be working.
At this point we decided that as we had to chown our way back to a working instance, we should just start a new instance and import our backup.
So we started a new EC2 instance, run the discourse_docker getting started instructions and started our import. Then we run into a weird issue: it could not create an index because the data did not match the uniqueness requirements of the index:
ERROR: could not create unique index "index_incoming_domains_on_name_and_https_and_port"
DETAIL: Key (name, https, port)=(homeassistant.home, f, 8123) is duplicated.
EXCEPTION: psql failed: DETAIL: Key (name, https, port)=(homeassistant.home, f, 8123) is duplicated.
/var/www/discourse/lib/backup_restore/database_restorer.rb:95:in `restore_dump'
But when we jump into the rails console of our running instance, it wasn’t a duplicate:
[7] pry(main)> IncomingDomain.where(name: "homeassistant.home")
=> [#<IncomingDomain:0x000055e5cabc3760 id: 8648, name: "homeassistant.home", https: false, port: 8123>]
So that’s where we are right now. And we’re kinda lost.
- We have a running instance with a bad DB compared to the Ruby code that is unable to migrate to newer Postgres
- We have a backup that cannot be imported in a new instance
We explored to see if we can move to a paid hosted Discourse, but since we have 3 million pageviews and a million posts, the enterprise pricing is too big of a commitment for us.
So we need to find a way out, preferably we can import our backup, but getting our old instance migrated would work too.
Anyone any ideas? We don’t mind paying someone to help us out either.