Large restore failing midway through

Hello again! I shelved the last migration I posted about, but picked it back up. Got a little more urgent with some issues on the (outdated) host the original installation lived on.

Context: During the migration to a new instance I decided to go ahead and migrate to an external PG and Redis instance. Setting up the fresh install went fine, and I’m planning on doing a backup + restore from the command line for the migration itself. Currently trying to just test the workflow and make sure the newly restored instance works as expected before I set the original instance read-only and proceed, particularly since this is a fairly old/large instance I’m migrating.

Old instance: simple self-hosting setup, one instance with the standard colocated PG, Postgres, Redis.
New instance: Same app.yml, but with external, managed PG + Redis (DigitalOcean)

The restore command seems to go fine for quite a while. Then, after a considerable amount of time, I consistently get an error that looks like this (including some success logs for context):

COPY 99820
COPY 3216770
COPY 3251307
SSL connection has been closed unexpectedly
FATAL:  terminating connection due to administrator command
CONTEXT:  COPY post_timings, line 63404000: "8311	4897	1816	6999"
SSL connection has been closed unexpectedly
FATAL:  terminating connection due to administrator command
CONTEXT:  COPY post_timings, line 63404000: "8311	4897	1816	6999"
invalid socket
connection to server was lost
EXCEPTION: psql failed: connection to server was lost
/var/www/discourse/lib/backup_restore/database_restorer.rb:95:in `restore_dump'

It’s happened at different specific copy points across runs, so as far as I can tell this isn’t anything specific to the migration itself. Given that clearly the DB connections all work, I think it’s safe to assume this is something related to DigitalOcean’s behavior, but I’m hoping someone else in here has seen something like this before and can point me in the right direction.

Since it’s a managed PG instance, you’ll want to check logs for that service.

(always look at the logs!)

You might find it tells you why this happened, for instance maybe it has a maximum connection lifetime configured and the restore is taking longer to complete.

2 Likes

Well…I uh…did check the logs and it was mostly the same information I was seeing in Discourse logs, but digging back through them caused me to actually check historical graphs around that time…I’d undersized the test DB :facepalm: so it filled up and DO just kills the connection. Doh.

Resized everything and we’re feeling dumb but back off to the races.

2 Likes