Can't bootstrap: "PANIC: could not locate a valid checkpoint record"


(Zach Alexander) #1

Oh, boy.

I just made what seemed like a minor edit to app.yml (new SMTP credentials), and when I tried to rebuild, I got this:

2014-06-11 22:59:15 UTC LOG:  database system was interrupted; last known up at 2014-06-11 22:55:34 UTC
2014-06-11 22:59:15 UTC LOG:  invalid primary checkpoint record
2014-06-11 22:59:15 UTC LOG:  invalid secondary checkpoint record
2014-06-11 22:59:15 UTC PANIC:  could not locate a valid checkpoint record
2014-06-11 22:59:15 UTC LOG:  startup process (PID 76) was terminated by signal 6: Aborted
2014-06-11 22:59:15 UTC LOG:  aborting startup due to startup process failure

The internets tell me to try pg_resetxlog, but the command is unavailable. (Will try to install and run it unless people advise me otherwise.)

And no, I don’t have a recent backup, because per this thread, my forum is apparently creating corrupted backups. (@zogstrip is looking into this I think.) :<

Any ideas? Thanks in advance.

Full log:

WARNING: No swap limit support
Stopping old container
5bc57b3806abcf9e81d4e3624573e07f9c39c36d68f3d9c6ae8d73507a83d67c
Calculated ENV: -e HOME=/root -e RAILS_ENV=production -e UNICORN_WORKERS=3 -e RUBY_GC_MALLOC_LIMIT=40000000 -e RUBY_HEAP_MIN_SLOTS=800000 -e DISCOURSE_DB_SOCKET=/var/run/postgresql -e DISCOURSE_DB_HOST= -e DISCOURSE_DB_PORT= -e DISCOURSE_DEVELOPER_EMAILS=<emails> -e DISCOURSE_HOSTNAME=<domain> -e DISCOURSE_SMTP_ADDRESS=smtp.mailgun.org -e DISCOURSE_SMTP_PORT=587 -e DISCOURSE_SMTP_USER_NAME=postmaster@<domain> -e DISCOURSE_SMTP_PASSWORD=2n74zq4396t9
cd /pups && git pull && /pups/bin/pups --stdin
Already up-to-date.
I, [2014-06-11T22:59:09.514573 #32]  INFO -- : Loading --stdin
I, [2014-06-11T22:59:09.528032 #32]  INFO -- : File > /etc/service/cron/run  chmod: +x
I, [2014-06-11T22:59:09.533623 #32]  INFO -- : File > /etc/service/rsyslog/run  chmod: +x
I, [2014-06-11T22:59:09.534198 #32]  INFO -- : > echo cron installed
I, [2014-06-11T22:59:09.536646 #32]  INFO -- : cron installed

I, [2014-06-11T22:59:09.542532 #32]  INFO -- : File > /var/lib/postgresql/take-database-backup  chmod: +x
I, [2014-06-11T22:59:09.546289 #32]  INFO -- : File > /var/spool/cron/crontabs/postgres  chmod:
I, [2014-06-11T22:59:09.547043 #32]  INFO -- : > apt-get -y install rsyslog
I, [2014-06-11T22:59:12.307508 #32]  INFO -- : Reading package lists...
Building dependency tree...
Reading state information...
rsyslog is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.

I, [2014-06-11T22:59:12.308486 #32]  INFO -- : > apt-get install -y socat
I, [2014-06-11T22:59:14.915321 #32]  INFO -- : Reading package lists...
Building dependency tree...
Reading state information...
socat is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.

I, [2014-06-11T22:59:14.916483 #32]  INFO -- : > mkdir -p /shared/postgres_run
I, [2014-06-11T22:59:14.920092 #32]  INFO -- :
I, [2014-06-11T22:59:14.920813 #32]  INFO -- : > chown postgres:postgres /shared/postgres_run
I, [2014-06-11T22:59:14.924709 #32]  INFO -- :
I, [2014-06-11T22:59:14.925474 #32]  INFO -- : > chmod 775 /shared/postgres_run
I, [2014-06-11T22:59:14.927919 #32]  INFO -- :
I, [2014-06-11T22:59:14.928595 #32]  INFO -- : > rm -fr /var/run/postgresql
I, [2014-06-11T22:59:14.931496 #32]  INFO -- :
I, [2014-06-11T22:59:14.932232 #32]  INFO -- : > ln -s /shared/postgres_run /var/run/postgresql
I, [2014-06-11T22:59:14.934729 #32]  INFO -- :
I, [2014-06-11T22:59:14.935359 #32]  INFO -- : > socat /dev/null UNIX-CONNECT:/shared/postgres_run/.s.PGSQL.5432 || exit 0 && echo postgres already running stop container ; exit 1
2014/06/11 22:59:14 socat[56] E connect(4, AF=1 "/shared/postgres_run/.s.PGSQL.5432", 36): Connection refused
I, [2014-06-11T22:59:14.941489 #32]  INFO -- :
I, [2014-06-11T22:59:14.942084 #32]  INFO -- : > rm -fr /shared/postgres_run/.s*
I, [2014-06-11T22:59:14.945366 #32]  INFO -- :
I, [2014-06-11T22:59:14.945871 #32]  INFO -- : > rm -fr /shared/postgres_run/*.pid
I, [2014-06-11T22:59:14.948982 #32]  INFO -- :
I, [2014-06-11T22:59:14.954387 #32]  INFO -- : File > /etc/service/postgres/run  chmod: +x
I, [2014-06-11T22:59:14.959534 #32]  INFO -- : File > /root/upgrade_postgres  chmod: +x
I, [2014-06-11T22:59:14.960804 #32]  INFO -- : > chown -R root /var/lib/postgresql/9.3/main
I, [2014-06-11T22:59:15.147881 #32]  INFO -- :
I, [2014-06-11T22:59:15.148344 #32]  INFO -- : > [ ! -e /shared/postgres_data ] && install -d -m 0755 -o postgres -g postgres /shared/postgres_data && sudo -u postgres /usr/lib/postgresql/9.3/bin/initdb -D /shared/postgres_data || exit 0
I, [2014-06-11T22:59:15.151611 #32]  INFO -- :
I, [2014-06-11T22:59:15.151752 #32]  INFO -- : > chown -R postgres:postgres /shared/postgres_data
I, [2014-06-11T22:59:15.169879 #32]  INFO -- :
I, [2014-06-11T22:59:15.170183 #32]  INFO -- : > chown -R postgres:postgres /var/run/postgresql
I, [2014-06-11T22:59:15.173469 #32]  INFO -- :
I, [2014-06-11T22:59:15.174011 #32]  INFO -- : > /root/upgrade_postgres
I, [2014-06-11T22:59:15.179036 #32]  INFO -- :
I, [2014-06-11T22:59:15.179566 #32]  INFO -- : Replacing data_directory = '/var/lib/postgresql/9.3/main' with data_directory = '/shared/postgres_data' in /etc/postgresql/9.3/main/postgresql.conf
I, [2014-06-11T22:59:15.180490 #32]  INFO -- : Replacing (?-mix:#?listen_addresses *=.*) with listen_addresses = '*' in /etc/postgresql/9.3/main/postgresql.conf
I, [2014-06-11T22:59:15.182608 #32]  INFO -- : > install -d -m 0755 -o postgres -g postgres /shared/postgres_backup
I, [2014-06-11T22:59:15.186967 #32]  INFO -- :
I, [2014-06-11T22:59:15.187385 #32]  INFO -- : Replacing (?-mix:#?max_wal_senders *=.*) with max_wal_senders = 4 in /etc/postgresql/9.3/main/postgresql.conf
I, [2014-06-11T22:59:15.188117 #32]  INFO -- : Replacing (?-mix:#?wal_level *=.*) with wal_level = hot_standby in /etc/postgresql/9.3/main/postgresql.conf
I, [2014-06-11T22:59:15.188962 #32]  INFO -- : Replacing (?-mix:^#local +replication +postgres +peer$) with local replication postgres  peer in /etc/postgresql/9.3/main/pg_hba.conf
I, [2014-06-11T22:59:15.189615 #32]  INFO -- : Replacing (?-mix:^host.*all.*all.*127.*$) with host all all 0.0.0.0/0 md5 in /etc/postgresql/9.3/main/pg_hba.conf
I, [2014-06-11T22:59:15.190428 #32]  INFO -- : > sudo -u postgres /usr/lib/postgresql/9.3/bin/postmaster -D /etc/postgresql/9.3/main
I, [2014-06-11T22:59:15.192798 #32]  INFO -- : > sleep 5
2014-06-11 22:59:15 UTC LOG:  database system was interrupted; last known up at 2014-06-11 22:55:34 UTC
2014-06-11 22:59:15 UTC LOG:  invalid primary checkpoint record
2014-06-11 22:59:15 UTC LOG:  invalid secondary checkpoint record
2014-06-11 22:59:15 UTC PANIC:  could not locate a valid checkpoint record
2014-06-11 22:59:15 UTC LOG:  startup process (PID 76) was terminated by signal 6: Aborted
2014-06-11 22:59:15 UTC LOG:  aborting startup due to startup process failure
I, [2014-06-11T22:59:20.195500 #32]  INFO -- :
I, [2014-06-11T22:59:20.196263 #32]  INFO -- : > sudo -u postgres createdb discourse || exit 0
createdb: could not connect to database template1: could not connect to server: No such file or directory
	Is the server running locally and accepting
	connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
I, [2014-06-11T22:59:20.268469 #32]  INFO -- :
I, [2014-06-11T22:59:20.269604 #32]  INFO -- : > sudo -u postgres psql discourse
I, [2014-06-11T22:59:20.272313 #32]  INFO -- : create user discourse;

psql: could not connect to server: No such file or directory
	Is the server running locally and accepting
	connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
I, [2014-06-11T22:59:20.343965 #32]  INFO -- : > sudo -u postgres psql discourse
I, [2014-06-11T22:59:20.346607 #32]  INFO -- : grant all privileges on database discourse to discourse;

psql: could not connect to server: No such file or directory
	Is the server running locally and accepting
	connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
I, [2014-06-11T22:59:20.413209 #32]  INFO -- : > /bin/bash -c 'sudo -u postgres psql discourse <<< "alter schema public owner to discourse;"'
psql: could not connect to server: No such file or directory
	Is the server running locally and accepting
	connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
I, [2014-06-11T22:59:20.484599 #32]  INFO -- :
/pups/lib/pups/exec_command.rb:85:in `spawn': /bin/bash -c 'sudo -u postgres psql discourse <<< "alter schema public owner to discourse;"' failed with return #<Process::Status: pid 96 exit 2> (RuntimeError)
	from /pups/lib/pups/exec_command.rb:55:in `block in run'
	from /pups/lib/pups/exec_command.rb:53:in `each'
	from /pups/lib/pups/exec_command.rb:53:in `run'
	from /pups/lib/pups/command.rb:5:in `run'
	from /pups/lib/pups/config.rb:85:in `block (2 levels) in run_commands'
	from /pups/lib/pups/config.rb:76:in `each'
	from /pups/lib/pups/config.rb:76:in `block in run_commands'
	from /pups/lib/pups/config.rb:75:in `each'
	from /pups/lib/pups/config.rb:75:in `run_commands'
	from /pups/lib/pups/config.rb:71:in `run'
	from /pups/lib/pups/cli.rb:31:in `run'
	from /pups/bin/pups:8:in `<main>'
094bd31f4dc4f70442d37932683d6c1b1bc3188bbb7ea3af8879f4a88bc968d2
FAILED TO BOOTSTRAP

How to become the postgres superuser?
Admin/docker url went away on my site
Admin/docker url went away on my site
(Jeff Atwood) #2

Is something going on with the disk subsystem on your server? Haven’t heard much about database corruptions…


(Zach Alexander) #3

Not that I’m aware of. I’m on Digital Ocean and haven’t noticed any other issues.


(Zach Alexander) #4

Update: got pg_resetxlog to work, which fixed the proximate issue.

I think (actually, a coworker thinks) this may have happened because of duplicate containers that got running somehow, connecting to the same database.


(Jeff Atwood) #5

Oh yes that could definitely cause it. Anything weird happen with deployment or updates where multiple containers were somehow running?


(Zach Alexander) #6

Not that I noticed.

I restored from a server backup, and we’re live again, but I’m still pretty freaked out because admin panel backups are still failing. When complete, they say 2.2 GB, but download at 1.1 GB, and the downloaded archive is unzippable :<

Edit: the latter is just because the download is ending prematurely from web. Using scp it downloads fine.


(Sam Saffron) #7

On mobile, but have recovered from this stuff before the binary is inside the container, just missing from path


(Zach Alexander) #8

Thanks. The situation has stabilized somewhat – I was able to restore from a backup, which was also corrupted, but pg_resetxlog repaired it enough to rebuild the app again.

The database still seems corrupted though, as I can’t restore from a backup due to unique key constraints being violated. I’ll post more about that tomorrow in a separate thread. :confused: