Large restore failing midway through

mmcc · January 6, 2025, 9:39pm

Hello again! I shelved the last migration I posted about, but picked it back up. Got a little more urgent with some issues on the (outdated) host the original installation lived on.

Context: During the migration to a new instance I decided to go ahead and migrate to an external PG and Redis instance. Setting up the fresh install went fine, and I’m planning on doing a backup + restore from the command line for the migration itself. Currently trying to just test the workflow and make sure the newly restored instance works as expected before I set the original instance read-only and proceed, particularly since this is a fairly old/large instance I’m migrating.

Old instance: simple self-hosting setup, one instance with the standard colocated PG, Postgres, Redis.
New instance: Same app.yml, but with external, managed PG + Redis (DigitalOcean)

The restore command seems to go fine for quite a while. Then, after a considerable amount of time, I consistently get an error that looks like this (including some success logs for context):

COPY 99820
COPY 3216770
COPY 3251307
SSL connection has been closed unexpectedly
FATAL:  terminating connection due to administrator command
CONTEXT:  COPY post_timings, line 63404000: "8311	4897	1816	6999"
SSL connection has been closed unexpectedly
FATAL:  terminating connection due to administrator command
CONTEXT:  COPY post_timings, line 63404000: "8311	4897	1816	6999"
invalid socket
connection to server was lost
EXCEPTION: psql failed: connection to server was lost
/var/www/discourse/lib/backup_restore/database_restorer.rb:95:in `restore_dump'

It’s happened at different specific copy points across runs, so as far as I can tell this isn’t anything specific to the migration itself. Given that clearly the DB connections all work, I think it’s safe to assume this is something related to DigitalOcean’s behavior, but I’m hoping someone else in here has seen something like this before and can point me in the right direction.

supermathie · January 6, 2025, 9:48pm

Since it’s a managed PG instance, you’ll want to check logs for that service.

(always look at the logs!)

You might find it tells you why this happened, for instance maybe it has a maximum connection lifetime configured and the restore is taking longer to complete.

mmcc · January 7, 2025, 2:33am

Well…I uh…did check the logs and it was mostly the same information I was seeing in Discourse logs, but digging back through them caused me to actually check historical graphs around that time…I’d undersized the test DB so it filled up and DO just kills the connection. Doh.

Resized everything and we’re feeling dumb but back off to the races.

system · February 6, 2025, 2:34am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Timeout when restoring a backup Installation	2	868	December 16, 2020
Error when migrating to an external postgres instance Installation	12	378	January 18, 2024
Can't restore saved backup Installation	3	1199	March 7, 2017
Restore fails, hangs, timeouts on ALTER TABLE Support	3	413	April 3, 2021
Restore error -- database escape problem? Support	7	1295	June 8, 2024

Large restore failing midway through

Related topics