Restore failing- possible Data Explorer issue

I have a working forum and want to see some things that got messed up a few days ago. I’m on AWS, so I make an AMI of the working forum, launch a new instance, and try to restore a backup from a few days ago. It fails with the messages below.

It can’t be a version or schema mis-match because the server is built from a fresh image of the working forum.

I’ve tried rebuilding.

I’ve tried restoring from a different backup from just one day ago- same thing.

The only odd thing I’ve done is I deleted pdf files from the upload directory (…/uploads/original/1X/*.pdf) to free up some space. I’ll try again without that step but it seems unlikely to be the culprit.

> [2019-11-30 01:17:44] 'admin' has started the restore!
> [2019-11-30 01:17:44] Marking restore as running...
> [2019-11-30 01:17:44] Making sure /var/www/discourse/tmp/restores/default/2019-11-30-011744 exists...
> [2019-11-30 01:17:44] Downloading archive to tmp directory...
> [2019-11-30 01:23:24] Unzipping archive, this may take a while...
> [2019-11-30 01:27:52] No metadata file to extract.
> [2019-11-30 01:27:52] Validating metadata...
> [2019-11-30 01:27:52]   Current version: 20191129144706
> [2019-11-30 01:27:52]   Restored version: 20191120015344
> [2019-11-30 01:27:52] Extracting dump file...
> [2019-11-30 01:50:57] invalid command \N
> [2019-11-30 01:50:57] invalid command \N
> 
> < repeats about 100 times >
> 
> [2019-11-30 01:51:07] invalid command \N
> [2019-11-30 01:54:13] invalid command \N
> [2019-11-30 01:54:13] EXCEPTION: psql failed
> [2019-11-30 01:54:14] /var/www/discourse/lib/backup_restore/restorer.rb:331:in `restore_dump'
> /var/www/discourse/lib/backup_restore/restorer.rb:75:in `run'
> /var/www/discourse/lib/backup_restore.rb:166:in `block in start!'
> /var/www/discourse/lib/backup_restore.rb:163:in `fork'
> /var/www/discourse/lib/backup_restore.rb:163:in `start!'
> /var/www/discourse/lib/backup_restore.rb:22:in `restore!'
> /var/www/discourse/app/controllers/admin/backups_controller.rb:119:in `restore'
> etc...
1 Like

Nevertheless, the invalid command \N error is typical for a Postgres version mismatch…

root@example:/var/www/discourse# psql --version
psql (PostgreSQL) 10.10 (Debian 10.10-1.pgdg100+1)

It’s the same version on the new server and working forum.

I to associate those errors with postgres version match, but I saw those \N errors the other day on a system that had run out of disk space (I was restoring to the same system that made the backup). I didn’t finish diagnosing the problem (it was another bizarre problem that I was having and restoring a backup to another server solved the problem; I wondered if restoring to the same server would have solved it).

You mentioned that you were low on space. I suspect that’s the issue. Restore uses lots of space as it uncompresses the backup and hence has two full copies of it plus the space needed to restore it and to be able to revert when it fails.

1 Like

So it takes 23 minutes before it fails?

1 Like

It gets worse, but maybe closer to the true problem… On the hypothesis that I needed more disk space, I created a new instance from my image, this time with 100 GB, compared to 50 GB before. (Backups are 5 GB each and stored on S3.) This time I got an explicit error- “No space left on device”. Yet df shows heaps of space.

> [2019-11-29 22:42:58] Making sure /var/www/discourse/tmp/restores/default/2019-11-29-224258 exists...
> [2019-11-29 22:42:58] Downloading archive to tmp directory...
> [2019-11-29 22:45:46] Unzipping archive, this may take a while...
> [2019-11-29 22:51:46] No metadata file to extract.
> [2019-11-29 22:51:46] Validating metadata...
> [2019-11-29 22:51:46]   Current version: 20191129144706
> [2019-11-29 22:51:46]   Restored version: 20191108000414
> [2019-11-29 22:51:46] Extracting dump file...
> [2019-11-29 22:53:47] EXCEPTION: No space left on device @ io_write - /shared/tmp/restores/default/2019-11-29-224258/dump.sql

> [ec2-user@ip-172-31-47-237 discourse]$ df / -h
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/xvda1       99G   28G   71G  28% /

Interestingly, there is nothing in the referenced directory:

> /var/www/discourse# ls /shared/tmp/*
> /shared/tmp/backups:
> < empty >
> /shared/tmp/restores:
> < empty >

Could it be an issue with the swap file? The EC2 instance is a t2.small with 2 GB memory, so I had long ago created a swap file on the working forum. I assume the swap file would be replicated on the new instance. I’m no expert, but I think the swapfile exists, because it wouldn’t let me create a new one, and because:

> /var/www/discourse# swapon -s
> Filename                                Type            Size    Used    Priority
> /swapfile                               file            2097148 1024    -2

Could be an inode issue. What does df -i tell you?

df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
devtmpfs 252562 437 252125 1% /dev
tmpfs 255203 1 255202 1% /dev/shm
/dev/xvda1 6553600 737194 5816406 12% /

Getting out of my league, but I assume that’s not bad?

No, that’s ok.

‘No space left on device’ does not just occur when a device is out of gigabytes but also when the filesystem is out of inodes. But that is clearly not the issue here. (iUse% would be at 100% then).

2 Likes

Still no joy. I thought I would try restoring on a new Lightsail instance, rather than launching an AMI of my working EC2 instance. It still fails, but the messages are somewhat different.

Both old and new instances are up-to-date, both are standard Docker installs, and both are running the same postgres version:

psql --version
psql (PostgreSQL) 10.10 (Debian 10.10-1.pgdg100+1)

Is this normal:

Creating missing functions in the discourse_functions schema
Cannot restore into different schema, restoring in-place

Could it be related to plugins? I have several plugins, both supported and custom, installed on the “source” site. Some use custom user fields. I’ve tried restoring to clean “destination” sites with and without plugins.

Any pointers how to start comparing schemas?

> [2019-12-07 04:51:36] 'admin' has started the restore!
> [2019-12-07 04:51:36] Marking restore as running...
> [2019-12-07 04:51:36] Making sure /var/www/discourse/tmp/restores/default/2019-12-07-045136 exists...
> [2019-12-07 04:51:36] Downloading archive to tmp directory...
> [2019-12-07 04:53:49] Unzipping archive, this may take a while...
> [2019-12-07 04:57:12] No metadata file to extract.
> [2019-12-07 04:57:12] Validating metadata...
> [2019-12-07 04:57:12]   Current version: 20191129144706
> [2019-12-07 04:57:12]   Restored version: 20191120015344
> [2019-12-07 04:57:12] Extracting dump file...
> [2019-12-07 04:59:10] Creating missing functions in the discourse_functions schema
> [2019-12-07 04:59:11] Cannot restore into different schema, restoring in-place
> [2019-12-07 05:05:02] ERROR:  current transaction is aborted, commands ignored until end of transaction block
> [2019-12-07 05:05:03] ERROR:  current transaction is aborted, commands ignored until end of transaction block
> < repeats about 100 times >
> [2019-12-07 05:05:03] ERROR:  current transaction is aborted, commands ignored until end of transaction block
> [2019-12-07 05:05:03] EXCEPTION: psql failed
> [2019-12-07 05:05:03] /var/www/discourse/lib/backup_restore/restorer.rb:331:in `restore_dump'
> /var/www/discourse/lib/backup_restore/restorer.rb:75:in `run'
> /var/www/discourse/lib/backup_restore.rb:166:in `block in start!'
> /var/www/discourse/lib/backup_restore.rb:163:in `fork'
> /var/www/discourse/lib/backup_restore.rb:163:in `start!'
> /var/www/discourse/lib/backup_restore.rb:22:in `restore!'
> /var/www/discourse/app/controllers/admin/backups_controller.rb:119:in `restore'
> < rest of traceback >

Yes, that’s normal.

There is clearly something going wrong inside PostgreSQL. Have you taken a look at its logs?

Longshot: could this be memory related? Can you try to monitor free -m output while restoring and see if it runs out of (virtual) memory.

1 Like

I know it’s a hard question since we don’t know what the problem is, but generally should I install the plugins on the destination site before trying to restore? Or does restore pull and build the plugins?

Yes you should - restore won’t do that for you.

OTOH I don’t think that this is your issue, since restore will take care of the right database structure (including plugin-specific stuff).

I created a new $20 Lightsail with 4GB memory. I watched ‘free -m’ during the restore. It always had plenty of free and available memory.

The specific errors vary depending on whether I have the plugins installed, but are probably due to the same root cause. In this case I did not install the plugins before the backup. There are two classes of errors:

  1. In the postgres logs, I get a lot of these, sometimes at character 34 and sometimes at character 41.

discourse@discourse ERROR: relation “user_auth_tokens” does not exist at character 34

Those do not have a corresponding output in the admin > backup > log page, and the restore continues for several minutes despite these errors.

  1. The second error depends on whether plugins are installed. In this case I did not install them, so I get an error about Data Explorer, and that’s when it all goes pear-shaped.

In admin > backup > logs:

[2019-12-07 07:38:34] CREATE INDEX
[2019-12-07 07:38:34] CREATE INDEX
[2019-12-07 07:38:34] ERROR: could not create unique index “index_plugin_store_rows_on_plugin_name_and_key”
[2019-12-07 07:38:34] DETAIL: Key (plugin_name, key)=(discourse-data-explorer, q:-6) is duplicated.
[2019-12-07 07:38:34] ERROR: current transaction is aborted, commands ignored until end of transaction block
[2019-12-07 07:38:34] ERROR: current transaction is aborted, commands ignored until end of transaction block
< repeats 1000X and quits >

And the corresponding output in the posgres log:

2019-12-07 07:38:34.718 UTC [8991] discourse@discourse LOG: duration: 165.427 ms statement: CREATE INDEX index_notifications_on_user_
id_and_topic_id_and_post_number ON public.notifications USING btree (user_id, topic_id, post_number);
2019-12-07 07:38:34.767 UTC [8991] discourse@discourse ERROR: could not create unique index “index_plugin_store_rows_on_plugin_name_an
d_key”
2019-12-07 07:38:34.767 UTC [8991] discourse@discourse DETAIL: Key (plugin_name, key)=(discourse-data-explorer, q:-6) is duplicated.
2019-12-07 07:38:34.767 UTC [8991] discourse@discourse STATEMENT: CREATE UNIQUE INDEX index_plugin_store_rows_on_plugin_name_and_key O
N public.plugin_store_rows USING btree (plugin_name, key);
2019-12-07 07:38:34.984 UTC [8991] discourse@discourse ERROR: current transaction is aborted, commands ignored until end of transactio
n block
2019-12-07 07:38:34.984 UTC [8991] discourse@discourse STATEMENT: CREATE INDEX index_policy_users_on_post_policy_id_and_user_id ON pub
lic.policy_users USING btree (post_policy_id, user_id);

1 Like

I think #1 is not fatal and just a side effect of the in-place restore.

You might consider (exporting and) deleting all data explorer queries and removing the data explorer plugin prior to creating your backup.

Alternatively, can you post the relevant content of the plugin_store_rows table?

2 Likes

There are indeed duplicate queries with duplicate (plugin_name, key) pairs, e.g. q:-11 and q:-2, but unique ids. I don’t see any pattern among the duplicates, like they’re not my favorite queries or anything.

So my next step will be to remove the duplicates, make a backup, and try to restore from that.

image

SELECT id, plugin_name, key from plugin_store_rows
WHERE plugin_name = ‘discourse-data-explorer’
ORDER BY key

id plugin_name key
1138 discourse-data-explorer q:-1
1136 discourse-data-explorer q:-10
813 discourse-data-explorer q:10
1142 discourse-data-explorer q:-11
1397 discourse-data-explorer q:-11
825 discourse-data-explorer q:11
889 discourse-data-explorer q:13
1004 discourse-data-explorer q:14
1005 discourse-data-explorer q:15
1043 discourse-data-explorer q:17
1044 discourse-data-explorer q:18
514 discourse-data-explorer q:-2
1249 discourse-data-explorer q:-2
764 discourse-data-explorer q:2
1053 discourse-data-explorer q:21
1066 discourse-data-explorer q:22
1082 discourse-data-explorer q:23
1097 discourse-data-explorer q:24
1131 discourse-data-explorer q:26
1132 discourse-data-explorer q:27
1134 discourse-data-explorer q:28
1135 discourse-data-explorer q:29
775 discourse-data-explorer q:3
1137 discourse-data-explorer q:30
1140 discourse-data-explorer q:31
1141 discourse-data-explorer q:32
1143 discourse-data-explorer q:33
1149 discourse-data-explorer q:34
1155 discourse-data-explorer q:35
1156 discourse-data-explorer q:36
1157 discourse-data-explorer q:37
1158 discourse-data-explorer q:38
1161 discourse-data-explorer q:39
513 discourse-data-explorer q:-4
777 discourse-data-explorer q:4
1211 discourse-data-explorer q:40
1215 discourse-data-explorer q:41
1223 discourse-data-explorer q:42
1224 discourse-data-explorer q:43
1225 discourse-data-explorer q:44
1226 discourse-data-explorer q:45
1269 discourse-data-explorer q:46
1272 discourse-data-explorer q:47
1273 discourse-data-explorer q:48
1274 discourse-data-explorer q:49
1279 discourse-data-explorer q:50
1281 discourse-data-explorer q:51
1282 discourse-data-explorer q:52
1301 discourse-data-explorer q:53
1349 discourse-data-explorer q:54
1369 discourse-data-explorer q:55
1373 discourse-data-explorer q:56
1384 discourse-data-explorer q:57
1387 discourse-data-explorer q:58
1396 discourse-data-explorer q:59
1222 discourse-data-explorer q:-6
1348 discourse-data-explorer q:-6
781 discourse-data-explorer q:6
763 discourse-data-explorer q:-7
782 discourse-data-explorer q:7
515 discourse-data-explorer q:-8
791 discourse-data-explorer q:8
1139 discourse-data-explorer q:-9
798 discourse-data-explorer q:9
507 discourse-data-explorer q:_id

Actually how can I delete the duplicates? All three are owned by “system”, so I can’t edit/delete them.

Found the pattern. When I run a query owned by system, it creates a duplicate, which evidently breaks restore.

3 Likes

I can’t reproduce this on a clean test site, but it happens consistently on my production site. I’ve installed all the production plugins on the test site, still can’t repro.

  1. How can I track down what’s wrong with my production site?

  2. How can I remove the duplicate queries, since they are owned by system? Do I have to sudo -u postgres psql discourse…? Sounds scary.

1 Like

To make the backup restorable, you can probably delete the duplicate lines from the backed-up sql file.

It’s possible that the dev database is missing that index for some reason?

3 Likes