I have a working forum and want to see some things that got messed up a few days ago. I’m on AWS, so I make an AMI of the working forum, launch a new instance, and try to restore a backup from a few days ago. It fails with the messages below.
It can’t be a version or schema mis-match because the server is built from a fresh image of the working forum.
I’ve tried rebuilding.
I’ve tried restoring from a different backup from just one day ago- same thing.
The only odd thing I’ve done is I deleted pdf files from the upload directory (…/uploads/original/1X/*.pdf) to free up some space. I’ll try again without that step but it seems unlikely to be the culprit.
> [2019-11-30 01:17:44] 'admin' has started the restore!
> [2019-11-30 01:17:44] Marking restore as running...
> [2019-11-30 01:17:44] Making sure /var/www/discourse/tmp/restores/default/2019-11-30-011744 exists...
> [2019-11-30 01:17:44] Downloading archive to tmp directory...
> [2019-11-30 01:23:24] Unzipping archive, this may take a while...
> [2019-11-30 01:27:52] No metadata file to extract.
> [2019-11-30 01:27:52] Validating metadata...
> [2019-11-30 01:27:52] Current version: 20191129144706
> [2019-11-30 01:27:52] Restored version: 20191120015344
> [2019-11-30 01:27:52] Extracting dump file...
> [2019-11-30 01:50:57] invalid command \N
> [2019-11-30 01:50:57] invalid command \N
>
> < repeats about 100 times >
>
> [2019-11-30 01:51:07] invalid command \N
> [2019-11-30 01:54:13] invalid command \N
> [2019-11-30 01:54:13] EXCEPTION: psql failed
> [2019-11-30 01:54:14] /var/www/discourse/lib/backup_restore/restorer.rb:331:in `restore_dump'
> /var/www/discourse/lib/backup_restore/restorer.rb:75:in `run'
> /var/www/discourse/lib/backup_restore.rb:166:in `block in start!'
> /var/www/discourse/lib/backup_restore.rb:163:in `fork'
> /var/www/discourse/lib/backup_restore.rb:163:in `start!'
> /var/www/discourse/lib/backup_restore.rb:22:in `restore!'
> /var/www/discourse/app/controllers/admin/backups_controller.rb:119:in `restore'
> etc...
I to associate those errors with postgres version match, but I saw those \N errors the other day on a system that had run out of disk space (I was restoring to the same system that made the backup). I didn’t finish diagnosing the problem (it was another bizarre problem that I was having and restoring a backup to another server solved the problem; I wondered if restoring to the same server would have solved it).
You mentioned that you were low on space. I suspect that’s the issue. Restore uses lots of space as it uncompresses the backup and hence has two full copies of it plus the space needed to restore it and to be able to revert when it fails.
It gets worse, but maybe closer to the true problem… On the hypothesis that I needed more disk space, I created a new instance from my image, this time with 100 GB, compared to 50 GB before. (Backups are 5 GB each and stored on S3.) This time I got an explicit error- “No space left on device”. Yet df shows heaps of space.
> [2019-11-29 22:42:58] Making sure /var/www/discourse/tmp/restores/default/2019-11-29-224258 exists...
> [2019-11-29 22:42:58] Downloading archive to tmp directory...
> [2019-11-29 22:45:46] Unzipping archive, this may take a while...
> [2019-11-29 22:51:46] No metadata file to extract.
> [2019-11-29 22:51:46] Validating metadata...
> [2019-11-29 22:51:46] Current version: 20191129144706
> [2019-11-29 22:51:46] Restored version: 20191108000414
> [2019-11-29 22:51:46] Extracting dump file...
> [2019-11-29 22:53:47] EXCEPTION: No space left on device @ io_write - /shared/tmp/restores/default/2019-11-29-224258/dump.sql
> [ec2-user@ip-172-31-47-237 discourse]$ df / -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/xvda1 99G 28G 71G 28% /
Interestingly, there is nothing in the referenced directory:
Could it be an issue with the swap file? The EC2 instance is a t2.small with 2 GB memory, so I had long ago created a swap file on the working forum. I assume the swap file would be replicated on the new instance. I’m no expert, but I think the swapfile exists, because it wouldn’t let me create a new one, and because:
> /var/www/discourse# swapon -s
> Filename Type Size Used Priority
> /swapfile file 2097148 1024 -2
‘No space left on device’ does not just occur when a device is out of gigabytes but also when the filesystem is out of inodes. But that is clearly not the issue here. (iUse% would be at 100% then).
Still no joy. I thought I would try restoring on a new Lightsail instance, rather than launching an AMI of my working EC2 instance. It still fails, but the messages are somewhat different.
Both old and new instances are up-to-date, both are standard Docker installs, and both are running the same postgres version:
Creating missing functions in the discourse_functions schema
Cannot restore into different schema, restoring in-place
Could it be related to plugins? I have several plugins, both supported and custom, installed on the “source” site. Some use custom user fields. I’ve tried restoring to clean “destination” sites with and without plugins.
Any pointers how to start comparing schemas?
> [2019-12-07 04:51:36] 'admin' has started the restore!
> [2019-12-07 04:51:36] Marking restore as running...
> [2019-12-07 04:51:36] Making sure /var/www/discourse/tmp/restores/default/2019-12-07-045136 exists...
> [2019-12-07 04:51:36] Downloading archive to tmp directory...
> [2019-12-07 04:53:49] Unzipping archive, this may take a while...
> [2019-12-07 04:57:12] No metadata file to extract.
> [2019-12-07 04:57:12] Validating metadata...
> [2019-12-07 04:57:12] Current version: 20191129144706
> [2019-12-07 04:57:12] Restored version: 20191120015344
> [2019-12-07 04:57:12] Extracting dump file...
> [2019-12-07 04:59:10] Creating missing functions in the discourse_functions schema
> [2019-12-07 04:59:11] Cannot restore into different schema, restoring in-place
> [2019-12-07 05:05:02] ERROR: current transaction is aborted, commands ignored until end of transaction block
> [2019-12-07 05:05:03] ERROR: current transaction is aborted, commands ignored until end of transaction block
> < repeats about 100 times >
> [2019-12-07 05:05:03] ERROR: current transaction is aborted, commands ignored until end of transaction block
> [2019-12-07 05:05:03] EXCEPTION: psql failed
> [2019-12-07 05:05:03] /var/www/discourse/lib/backup_restore/restorer.rb:331:in `restore_dump'
> /var/www/discourse/lib/backup_restore/restorer.rb:75:in `run'
> /var/www/discourse/lib/backup_restore.rb:166:in `block in start!'
> /var/www/discourse/lib/backup_restore.rb:163:in `fork'
> /var/www/discourse/lib/backup_restore.rb:163:in `start!'
> /var/www/discourse/lib/backup_restore.rb:22:in `restore!'
> /var/www/discourse/app/controllers/admin/backups_controller.rb:119:in `restore'
> < rest of traceback >
I know it’s a hard question since we don’t know what the problem is, but generally should I install the plugins on the destination site before trying to restore? Or does restore pull and build the plugins?
I created a new $20 Lightsail with 4GB memory. I watched ‘free -m’ during the restore. It always had plenty of free and available memory.
The specific errors vary depending on whether I have the plugins installed, but are probably due to the same root cause. In this case I did not install the plugins before the backup. There are two classes of errors:
In the postgres logs, I get a lot of these, sometimes at character 34 and sometimes at character 41.
discourse@discourse ERROR: relation “user_auth_tokens” does not exist at character 34
Those do not have a corresponding output in the admin > backup > log page, and the restore continues for several minutes despite these errors.
The second error depends on whether plugins are installed. In this case I did not install them, so I get an error about Data Explorer, and that’s when it all goes pear-shaped.
In admin > backup > logs:
[2019-12-07 07:38:34] CREATE INDEX
[2019-12-07 07:38:34] CREATE INDEX
[2019-12-07 07:38:34] ERROR: could not create unique index “index_plugin_store_rows_on_plugin_name_and_key”
[2019-12-07 07:38:34] DETAIL: Key (plugin_name, key)=(discourse-data-explorer, q:-6) is duplicated.
[2019-12-07 07:38:34] ERROR: current transaction is aborted, commands ignored until end of transaction block
[2019-12-07 07:38:34] ERROR: current transaction is aborted, commands ignored until end of transaction block
< repeats 1000X and quits >
And the corresponding output in the posgres log:
2019-12-07 07:38:34.718 UTC [8991] discourse@discourse LOG: duration: 165.427 ms statement: CREATE INDEX index_notifications_on_user_
id_and_topic_id_and_post_number ON public.notifications USING btree (user_id, topic_id, post_number);
2019-12-07 07:38:34.767 UTC [8991] discourse@discourse ERROR: could not create unique index “index_plugin_store_rows_on_plugin_name_an
d_key”
2019-12-07 07:38:34.767 UTC [8991] discourse@discourse DETAIL: Key (plugin_name, key)=(discourse-data-explorer, q:-6) is duplicated.
2019-12-07 07:38:34.767 UTC [8991] discourse@discourse STATEMENT: CREATE UNIQUE INDEX index_plugin_store_rows_on_plugin_name_and_key O
N public.plugin_store_rows USING btree (plugin_name, key);
2019-12-07 07:38:34.984 UTC [8991] discourse@discourse ERROR: current transaction is aborted, commands ignored until end of transactio
n block
2019-12-07 07:38:34.984 UTC [8991] discourse@discourse STATEMENT: CREATE INDEX index_policy_users_on_post_policy_id_and_user_id ON pub
lic.policy_users USING btree (post_policy_id, user_id);
There are indeed duplicate queries with duplicate (plugin_name, key) pairs, e.g. q:-11 and q:-2, but unique ids. I don’t see any pattern among the duplicates, like they’re not my favorite queries or anything.
So my next step will be to remove the duplicates, make a backup, and try to restore from that.
SELECT id, plugin_name, key from plugin_store_rows
WHERE plugin_name = ‘discourse-data-explorer’
ORDER BY key
I can’t reproduce this on a clean test site, but it happens consistently on my production site. I’ve installed all the production plugins on the test site, still can’t repro.
How can I track down what’s wrong with my production site?
How can I remove the duplicate queries, since they are owned by system? Do I have to sudo -u postgres psql discourse…? Sounds scary.