Problem rebuilding because of slow database shutdown

This recommended upgrade failed and didn’t get my forum back up after breaking. I’m running discourse-doctor now to try to fix it, and if that fails too, I took a VM snapshot.

Output:

2023-04-19 18:28:31.298 UTC [42] LOG:  received fast shutdown request
2023-04-19 18:28:33.651 UTC [65] LOG:  shutting down
2023-04-19 18:28:33.974 UTC [42] LOG:  database system is shut down


FAILED
--------------------
Pups::ExecError: su postgres -c 'psql discourse -c "alter schema public owner to discourse;"' failed with return #<Process::Status: pid 59 exit 2>
Location of failure: /usr/local/lib/ruby/gems/3.2.0/gems/pups-1.1.1/lib/pups/exec_command.rb:117:in `spawn'
exec failed with the params "su postgres -c 'psql $db_name -c \"alter schema public owner to $db_user;\"'"
bootstrap failed with exit code 2
** FAILED TO BOOTSTRAP ** please scroll up and look for earlier error messages, there may be more than one.
./discourse-doctor may help diagnose the problem.
c13e1ba313de8fc84f6e2fb0f88197a908803c39791283effb8c82f55b56b6dc
Command exited with non-zero status 1
1.85user 1.84system 3:21.56elapsed 1%CPU (0avgtext+0avgdata 36996maxresident)k
197608inputs+368outputs (1133major+96509minor)pagefaults 0swaps

Are you on the beta branch?

You can try to restart your container iwth

 ./launcher start app

but that’s what discourse-doctor should do.

You’ll need to give more of the output as the error is above what you included.

2 Likes

Yes we are on the beta branch. I always run inside nohup, so I have the full log.

Discourse-doctor is still grinding away, but it hasn’t failed yet so I have hope.

https://pastebin.mozilla.org/iw2zc5zd

Edit: Discourse-doctor got us back up and running.

I basically asked for this, upgrading an hour after that notification and being the first one to do so. No real stress with that snapshot beforehand, so I took one for the team here fellas.

1 Like
  • 2023-04-19 18:28:26.755 UTC [45] LOG: database system was not properly shut down; automatic recovery in progress

If your database can’t stop safely in 60s, which will happen with large DBs with slower disks, it will enter this state and fail a rebuild if it can’t recover in 5s (which is rare since it’s large/slow).

This has nothing to do with the changes listed here, and is a problem in Discourse since at least 2016.

6 Likes

Ahh, thanks. Maybe it should wait longer for larger forums like ours. If you just kill the DB process it’ll need to rollback transactions after being started back up and that can take a very long time.

The terminology re beta is somewhat confusing. The admin dash says we’re running beta, is there somewhere else we should have looked? My understanding is beta is recommended for discourse based on the release announcements discouraging using the stable branch.

1 Like

The default is actually tests_passed, which is considered production-ready.

1 Like

How big is your database? Is it on ssd? How much ram do you have?

Having a separate data container would require fewer database restarts.

When was 60s decided on for a safe shutdown? How many installations are now much bigger than was then normal?

Ideally this 60s wait should be more of a closed-loop wait, with a limit. It sounds like the limit should be higher, if there are now many instances out there which are now vulnerable.

2 Likes

It’s 105GB, on SSD, 16GB VM, and I gave postgres an 8GB buffer pool.

I think I saw that it was at least as long ago as 2016. But things have changed. EDIT: Here’s a new commit.

I don’t think that many on a standard install, as it’s been this way almost since the beginning.

Uh, yeah. That’s a big database. I suspect that few people have a database that big that’s not on RDS or at least a seperate container. You should probably consider switching to a 2-container install.

1 Like

We’ll consider it, is the switching method documented? And are there any other advantages that increasing the 60s timer wouldn’t provide?

I increased it to 10 minutes yesterday

4 Likes

Oh great, I assumed he was posting the original commit back in 2016. So any advantages for us at all?

You can check out Move from standalone container to separate web and data containers

You can build a new container while the old one continues to run. You don’t need to shut down the database to build a new container.

There is now a 10 minute window for shutting down postgres, which should solve your current problem. Once you do one more rebuild, you’ll have the 10 minutes instead of one.

Oh that guy just built a completely new two container instance then restored from backup. We definitely aren’t doing that without a good reason, I just had to do it to avoid the PG13 upgrade disk space requirements like 2 months ago.

If you’re not on PG13 then you should fix that.

I’d spin up a new server and move to it.

We are now, that one was ultimately unavoidable! Beyond the DB, we also needed to upgrade from the desupported 18.04LTS.

1 Like

With a db this size you should move it to a dedicated container

It will speed up rebuilds a ton and simplify everything for you

1 Like

If there’s documentation on how to do that without a complete rebuild from scratch then restoring backups we’ll definitely consider it.

So you want to Migrate quickly to separate web and data containers

1 Like