Import failing with `could not create unique index`

TL;DR: We messed up an upgrade and are looking for help


With Home Assistant we use Discourse to power our community. We run with the discourse_docker method on an EC2 instance on AWS.

Being an open source project, forum maintenance fell through the cracks and we ended up with an old version, last updated beginning of 2019.

To make it worse, a prior upgrade we had pinned Postgres to 9.5 because we didn’t had the necessary disk space to upgrade to Postgres 10. We never resolved that issue.

We also had once made a change to the cloudflare template and had committed that to the repo, this prevented the docker_discourse branch from updating itself to the latest version.

Yesterday we decided to go for the upgrade…

When the database was getting migrated, we ran into an issue that syntax was used that was not compatible with 9.5:

== 20200429095034 AddTopicThumbnailInformation: migrating =====================
-- execute("ALTER TABLE posts\nADD COLUMN IF NOT EXISTS image_upload_id bigint\n")

We realized the issue of having 9.5 pinned pretty fast. So decided to migrate to Postgres 10. That didn’t work and we got the error:

I, [2020-06-12T00:30:55.448351 #1]  INFO -- : Upgrading PostgreSQL from version 9.5 to 10
WARNING: Upgrading PostgresSQL would require an addtional 89M of disk space
Please free up some space, or expand your disk, before continuing.

We had 47G available, so that was weird. We then realized that discourse_docker was out of date and so updated to the latest version. Surprise, Postgres 12 had just landed.

After running rebuild again, this time we got this error

I, [2020-06-12T00:41:17.378129 #1]  INFO -- : Upgrading PostgreSQL from version 9.5 to 12
WARNING: Upgrading PostgresSQL would require an addtional 92G of disk space
Please free up some space, or expand your disk, before continuing.

That’s a little more space, but sure. Let’s just bump our disk space to 300G and let’s run it again.

This time pg_upgrade broke during the migration:

Restoring database schemas in the new cluster
  template1
  discourse

*failure* Consult the last few lines of "pg_upgrade_dump_16384.log" for the probable cause of the failure. Failure, exiting

When we looked at the pg_upgrade_dump_16384.log file, we saw the following error:

pg_restore: creating VIEW "postgres_exporter.pg_stat_activity"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 721; 1259 678554 VIEW pg_stat_activity postgres
pg_restore: [archiver (db)] could not execute query: ERROR:  column pg_stat_activity.waiting does not exist
LINE 27:     "pg_stat_activity"."waiting",
             ^
    Command was:
-- For binary upgrade, must preserve pg_type oid
SELECT pg_catalog.binary_upgrade_set_next_pg_type_oid('678556'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_type array oid
SELECT pg_catalog.binary_upgrade_set_next_array_pg_type_oid('678555'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_class oids
SELECT pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('678554'::pg_catalog.oid);

CREATE VIEW "postgres_exporter"."pg_stat_activity" AS
 SELECT "pg_stat_activity"."datid",
    "pg_stat_activity"."datname",
    "pg_stat_activity"."pid",
    "pg_stat_activity"."usesysid",
    "pg_stat_activity"."usename",
    "pg_stat_activity"."application_name",
    "pg_stat_activity"."client_addr",
    "pg_stat_activity"."client_hostname",
    "pg_stat_activity"."client_port",
    "pg_stat_activity"."backend_start",
    "pg_stat_activity"."xact_start",
    "pg_stat_activity"."query_start",
    "pg_stat_activity"."state_change",
    "pg_stat_activity"."waiting",
    "pg_stat_activity"."state",
    "pg_stat_activity"."backend_xid",
    "pg_stat_activity"."backend_xmin",
    "pg_stat_activity"."query"
   FROM "pg_stat_activity";

Oh snap.

So this is where we decided to take a couple of steps back. Could we just get the forums up and running again and put it in read only mode while we figure out this backup business. We managed to do this by fixing some permission issues for both postgres and redis and the forums got back up online on the old version. Not everything works, ie going to admin -> user -> groups gets us this error:

NoMethodError (undefined method `automatic_membership_retroactive' for #<Group:0x00007fcaca3045e8>)
/var/www/discourse/vendor/bundle/ruby/2.6.0/gems/activemodel-6.0.1/lib/active_model/attribute_methods.rb:431:in `method_missing'

But rest seems to be working.

At this point we decided that as we had to chown our way back to a working instance, we should just start a new instance and import our backup.

So we started a new EC2 instance, run the discourse_docker getting started instructions and started our import. Then we run into a weird issue: it could not create an index because the data did not match the uniqueness requirements of the index:

ERROR:  could not create unique index "index_incoming_domains_on_name_and_https_and_port"
DETAIL:  Key (name, https, port)=(homeassistant.home, f, 8123) is duplicated.
EXCEPTION: psql failed: DETAIL:  Key (name, https, port)=(homeassistant.home, f, 8123) is duplicated.
/var/www/discourse/lib/backup_restore/database_restorer.rb:95:in `restore_dump'

But when we jump into the rails console of our running instance, it wasn’t a duplicate:

[7] pry(main)> IncomingDomain.where(name: "homeassistant.home")
=> [#<IncomingDomain:0x000055e5cabc3760 id: 8648, name: "homeassistant.home", https: false, port: 8123>]

So that’s where we are right now. And we’re kinda lost.

  • We have a running instance with a bad DB compared to the Ruby code that is unable to migrate to newer Postgres
  • We have a backup that cannot be imported in a new instance

We explored to see if we can move to a paid hosted Discourse, but since we have 3 million pageviews and a million posts, the enterprise pricing is too big of a commitment for us.

So we need to find a way out, preferably we can import our backup, but getting our old instance migrated would work too.

Anyone any ideas? We don’t mind paying someone to help us out either.

4 Likes

I believe the easiest path forward for you is to get a working backup and import into a new instance like you tried last.

Let’s try to fix the duplicated data:

# ssh into the machine
cd /var/discourse
./launcher enter app
su postgres
psql
\connect discourse
SELECT * FROM incoming_domains WHERE name LIKE '%homeassistant.home%';

# this should print multiple lines
# use SQL DELETE statements to fix it
# and exit with \q

Can you try the above and ask for more help if you get stuck?

5 Likes

Would I need to clean up IncomingLink and IncoingReferrer too, since referrer points at IncomingDomain and IncomingLink points at IncomingReferrer?

Doing the query now and will try to import another backup. Querying via postgres actually gives me different results than using Rails. But I guess that might be because of a default scope?

2 Likes

We’ve attempted importing another backup and it failed due to another corrupted index. We’ve reindexed all unique indexes on the original instance and are now experiencing this issue with a handful of users.

Will keep you posted.

1 Like

Alright, we made it out on the other side and we’re back online. Thanks for the hints @Falco .

To help other people with their problems, here a breakdown of the things we did.

So we had a couple of corrupt indexes which caused the import to fail. We were able to get it fixed by manually deleting the duplicates. We also had 8 users which had a duplicate username_lower (too many mike’s and marco’s). We renamed those by updating both username and username_lower. From the user data we noticed that the first corruption has happened in December 2019.

Instead of doing “make backup” -> “restore backup” -> “fail on duplicate” -> “fix” cycle, we decided to reindex all indexes. We found all indexes with unique constraints with the following query:

select idx.relname as index_name, 
       insp.nspname as index_schema,
       tbl.relname as table_name,
       tnsp.nspname as table_schema
from pg_index pgi
  join pg_class idx on idx.oid = pgi.indexrelid
  join pg_namespace insp on insp.oid = idx.relnamespace
  join pg_class tbl on tbl.oid = pgi.indrelid
  join pg_namespace tnsp on tnsp.oid = tbl.relnamespace
where pgi.indisunique --<< only unique indexes
  and tnsp.nspname = 'public'

Once all indexes worked, we were able to make a backup and import it correctly in the new instance. Migrations ran as expected, we swapped instances and we got up and running :+1: Cheers to the resilience of Discourse :beers:

Thanks again @Falco.

Have a good weekend :slight_smile:

6 Likes

One last tip for people debugging data corruption issues. Initially when our import failed on duplicate data I jumped into the Rails console and searched by the data that caused the index to fail to be created.

However, by querying using the indexed fields, Postgres was using the broken index to generate the results! So my initial query showed 1 result, and later when deleting that entry it showed 0 results.

Queries that do full table scans for the win :slight_smile:

4 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.