Changes introduced in postgresql 9.5.12 and 10.3 cause backup-related failures

Today I’ve been investigating issues related to the new “Avoid use of insecure search_path settings in pg_dump and other client programs” behaviour introduced into postgresql 9.5.12 and postgresql 10.3.

There’s a couple items that impact Discourse users:

  • backups taken using pg_dump of 9.5.12+ or 10.3+ are unable to be restored

    • this is due to the public schema now being explicitly referenced inside the restore script
      Discourse currently does a restore into the restore schema then switches that to the public schema - this now breaks
  • pg_dump pointed at pgbouncer can cause a full or partial site outage

    • an unintended side effect occurs that results in a null schema search_path being applied to anyone sharing the actual connection
    • this will cause the site to go down as the app servers can no longer implicitly see the public schema (i.e. all the data) until the connection is recycled
13 Likes

Thanks, @supermathie. Glad you’re on this.

As a stopgap, is there some way to push Docker images that have 9.5.11 in them so that we can go back to our dangerous, but happy lives? For me the discourse_dev image is what’s generating the backup that I really, really, wanted to show to people.

Here’s what I know. Thanks to @RGJ, I had this work-around worked out:

./launcher enter app
sv stop unicorn
sv stop redis
su postgres -c 'psql'
drop database discourse;
create database discourse;
\q
cd public/backups/default/
BACKUP=`ls |head`
zcat $BACKUP | su discourse  -c 'psql discourse'
sv start unicorn
sv start redis
rake db:migrate

This used to restore my no-files backup, but now it’s failing with

--------
      1
(1 row)

ERROR:  duplicate key value violates unique constraint "ar_internal_metadata_pkey"
DETAIL:  Key (key)=(environment) already exists.
CONTEXT:  COPY ar_internal_metadata, line 1
ERROR:  duplicate key value violates unique constraint "badge_groupings_pkey"
DETAIL:  Key (id)=(1) already exists.
CONTEXT:  COPY badge_groupings, line 1
 setval 
--------

And a bunch of other errors that are basically duplicate key kinds of things. It’s as if the database didn’t actually get dropped or. . . something?

We’re looking at this, yes.

In the meanwhile as a workaround in a container you can backrev your postgresql-client-9.5 packages to below 9.5.12 (or below 10.3) and the backup will work.

More or less: under the hood we restore to the restore schema and then atomically switch that to the public (production) schema so that the site is “up” during the restore. But these new versions of pg_dump generate a script that explicitly refers to the public schema… :boom: restore fails.

2 Likes

This is fixed in

https://github.com/discourse/discourse/commit/a89f3160a53081adbd2c05c4b3b146414500248a

12 Likes

We also need to handle a multisite configuration - a multisite won’t see the new backup_port unless it’s explicitly in the config:

base site:

[2] pry(main)> ActiveRecord::Base.connection_pool.spec.config
=> {:adapter=>"postgresql_fallback",
 :pool=>25,
 :connect_timeout=>5,
 :timeout=>5000,
 :host=>"2001:db8::1",
 :port=>6432,
 :backup_port=>5432,
 :username=>"pets",
 :password=>"pets",
 :replica_host=>"2001:db8::1001",
 :replica_port=>6432,
 :host_names=>["pets.discourse.org"],
 :database=>"pets",
 :prepared_statements=>false}

multisite:

[9] pry(main)> ActiveRecord::Base.connection_pool.spec.config
=> {:host=>"2001:db8::2",
 :replica_host=>"2001:db8::1002",
 :database=>"kittens",
 :username=>"kittens",
 :password=>"fluffy",
 :host_names=>["community.fluffykittens.com"],
 :pool=>8,
 :timeout=>5000,
 :adapter=>"postgresql_fallback",
 :prepared_statements=>false,
 :port=>6432,
 :replica_port=>6432,
 :connect_timeout=>5,
 :db_key=>"kittens"}
10 Likes

Now that we updated the base image, is there anything left to do here?

2 Likes