Corrupt indexes in PG12, how do I fix?

I’m just speculating but here is my tinfoil theory:

I ran Reindex concurrently twice (so first ccnew, ccnew1)
and both the times it presented errors.

maybe when it fails, it just doesn’t revert, it just leaves the garbage and quits.

8 Likes

This sounds very very very likely.

4 Likes

Just adding my $0.02 here,

Toast indexes can’t be dropped this way,
I had to:

su postgres
psql 
\connect discourse
drop index pg_toast.pg_toast_309322_index_ccnew;
drop index pg_toast.pg_toast_309322_index_ccnew1;

The above just applies to pg toast because discourse user does not have access to that index.
PG::InsufficientPrivilege: ERROR: permission denied for schema pg_toast

5 Likes

dropping toast indexes is super scary. I guess you have no choice here though. Yeah you need to do that from psql.

5 Likes

and here comes the happy reindex:

discourse=# REINDEX SCHEMA CONCURRENTLY public;
REINDEX

:partying_face:

6 Likes

Great catch! I found all indices with “ccnew” in the name via this query.

psql
\connect discourse
select tablename,indexname,indexdef from pg_indexes where indexname LIKE '%ccnew%';

incoming_referer.csv (7.2 KB)

Turned out I had a whopping 30 of them, all on the incoming_referers table. So I verified all the ccnew indices were actually duplicates via the indexdef column in this query.

select indexname,indexdef from pg_indexes where tablename = 'incoming_referers';

ccnew.csv (6.6 KB)

And then dropped all of them successfully.

DROP INDEX incoming_referers_pkey_ccnew;
DROP INDEX incoming_referers_pkey_ccnew_ccnew;
DROP INDEX incoming_referers_pkey_ccnew1;
...and so on for all 30

At that point I reindexed the whole schema again and it again was unable to rebuild two of the same incoming_referers ccnew indices, and also found three pg_toast indices. I dropped them and then reindexed the whole schema yet again, again more errors, found a bunch more ccnew indices in the discourse schema, reindexed a third time…

I can’t get a full reindex to complete without errors, it keeps creating and then failing to rebuild new ccnew indexes every time. After 4 full rebuilds I deleted the ccnew indexes then gave up on it. I guess I could try to rebuild non-concurrently but that would cause a bunch of downtime.

Anyway, my guess is most users upgrading from PG10 to 12 who tried to reindex afterwards have these extra ccnew indexes and they should all be deleted. They’ll just take up space and multiply disk write I/O for no benefit.

6 Likes

I ran it’s equivalent in Data-Explorer. was slightly easier to manage.

What I did was that I took all the ccnew/ccnew1/ccnew2…ccnewn indexes, nuked them and reindexed. That fixed it for me.

ccnew is a postgres thing of marking the indexes and I think it is some inefficiency in the process that causes these to be left if an indexing fails for some reason.

I can only suggest you first work upon the root cause of the problem. I had duplicated tags that I had to delete before proceeding to nuking indexes. if I was left with even a single duplicated tag, it will not index properly and create another ccnew'n' index and fail.

2 Likes

You’re right, I still have duplicates-- can’t rebuild the non-ccnew indexes non-concurrently. I need to remove these duplicate rows.

discourse=# reindex index index_incoming_referers_on_path_and_incoming_domain_id;
ERROR:  could not create unique index "index_incoming_referers_on_path_and_incoming_domain_id"
DETAIL:  Key (path, incoming_domain_id)=(/search/, 3433) is duplicated.
discourse=# reindex index index_incoming_referers_on_path_and_incoming_domain_id;
ERROR:  could not create unique index "index_incoming_referers_on_path_and_incoming_domain_id"
DETAIL:  Key (path, incoming_domain_id)=(/search/, 1861) is duplicated.

What’s really weird is I only see one row with each of those incoming_domain_id values in incoming_referer. So why are they duplicates?

discourse=# select * from incoming_referers where path='/search/' AND incoming_domain_id IN (1861,3433);
  id   |   path   | incoming_domain_id 
-------+----------+--------------------
 42845 | /search/ |               1861
 40763 | /search/ |               3433
(2 rows)

@sam or @riking should I delete these two rows as follows:

DELETE FROM incoming_referers WHERE path='/search/' AND incoming_domain_id IN (1861,3433);

… I guess I’m learning postgres after all, heh.

1 Like

No, that will delete both of them - you want to use the WHERE to find the two ids, and pick just one of the ids to delete.

Your query is using the corrupt index, which is why you’re only seeing one row for each. Try this:

... where path LIKE '%/search/' ...
3 Likes

OK, that brings up 43 rows.

discourse=# select * from incoming_referers where path LIKE '%/search/' ORDER BY incoming_domain_id;
  id   |    path     | incoming_domain_id 
-------+-------------+--------------------
   878 | /search/    |                 63
 33457 | /do/search/ |                567
  1580 | /search/    |                602
  1888 | /search/    |                663
 42983 | /search/    |               1259
  4896 | /search/    |               1788
 42845 | /search/    |               1861
  5162 | /search/    |               1861
  5176 | /search/    |               1866
 43350 | /search/    |               1905
 17238 | /search/    |               1905
 20689 | /search/    |               1982
  5781 | /hg/search/ |               1987
  8031 | /search/    |               2665
 10325 | /search/    |               3192
 11289 | /search/    |               3414
 40763 | /search/    |               3433
 42849 | /search/    |               3433
 13087 | /search/    |               3895
 13159 | /search/    |               3949
 13802 | /do/search/ |               4051
 14407 | /search/    |               4209
 14507 | /search/    |               4211
 15394 | /search/    |               4230
 15533 | /search/    |               4258
 45274 | /search/    |               5303
 20923 | /search/    |               5400
 21317 | /search/    |               5534
 22928 | /search/    |               5918
 22956 | /search/    |               5926
 37448 | /search/    |               6393
 25094 | /search/    |               6412
 25594 | /search/    |               6547
 39655 | /search/    |               6596
 27371 | /search/    |               6986
 27452 | /a/search/  |               7003
 27623 | /search/    |               7041
 31041 | /search/    |               7767
 36943 | /search/    |               8622
 37381 | /search/    |               8711
 37411 | /search/    |               8716
 40424 | /search/    |               9124
 44451 | /search/    |               9525
(43 rows)

So should I run the following?

DELETE FROM incoming_referers WHERE path LIKE '%/search/' AND id IN (42845,43350,42849);
2 Likes

Yep, that DELETE looks correct and should allow you to take a clean backup.

2 Likes

Believe it or not, more failures. First set, I found another duplicate key on /m/search which wasn’t caught by the %/search/ wildcard for some reason and deleted that. Reindexed again (each time taking nearly an hour!) and more failures showing a duplicate key on users(username_lower).

discourse=# reindex index index_users_on_username_lower;              
ERROR:  could not create unique index "index_users_on_username_lower"
DETAIL:  Key (username_lower)=(john_smith) is duplicated.

But the thing is, there was only one row with username_lower=john_smith! Time for the detective hat.

Looking at the forum admin UI, we had two separate users respectively named “john_smith” and “John_Smith”-- note the different capitalization. So I nuked the all-lowercase one at the forum level as he hadn’t been active in 4 years and then that index rebuilt OK.

Ran yet another full rebuild, each one taking nearly an hour, and FINALLY I think we’re good-- only one error, but no duplicates, just a pg_toast ccnew. I dropped it.

discourse=# REINDEX SCHEMA CONCURRENTLY public;               
WARNING:  cannot reindex invalid index "pg_toast.pg_toast_2783329_index_ccnew" concurrently, skipping
REINDEX

It’s been a journey.

4 Likes

Can/will these database fixes be automated in a future discousre upgrade?

1 Like

More likely pg index corruption will stop in a future pg upgrade, if Peter gets a consistent repro

6 Likes

If a concurrent REINDEX fails because the table has invalid data, you need to:

  1. Fix the invalid data, like it was done here.

  2. List the invalid indexes using SELECT pg_class.relname FROM pg_class, pg_index WHERE pg_index.indisvalid = false AND pg_index.indexrelid = pg_class.oid;

  3. Drop each invalid index listed above using DROP INDEX <indexname>;

  4. Try the REINDEX again.

7 Likes

After running the upgrade to 2.5.0beta5 and following the post-update guidance to reindex the database, I get this:

discourse=# REINDEX SCHEMA CONCURRENTLY public;
ERROR:  could not create unique index "index_plugin_store_rows_on_plugin_name_and_key_ccnew"
DETAIL:  Key (plugin_name, key)=(discourse-data-explorer, q:-10) is duplicated.

I’d rather not experiment with this one… so, how do I safely delete the duplicate?

2 Likes

No. The upgrade doesn’t corrupt your index, it just points out that your index is corrupt. You can check by trying to restore your backup to another server (even one that’s running pg10). Or try rebuilding your index on your existing install. It’s not clear what’s causing the corrupt indexes, but there is hope that pg12 will be less likely to have it happen.

There are some performance benefits of the upgrade, but holding off on it is not a bad idea

6 Likes

Too late now, so I’m still looking for guidance re. next steps.

1 Like

Something like

select id, plugin_name, key from plugin_store_rows where plugin_name like '%discourse-data-explorer%'")

should give you the rows. I think it’s safe enough to delete them.

3 Likes

I think ilike would be needed

3 Likes