Rebuild fails on db:migrate w/PG12

Running “./launcher rebuild app” fails on db:migrate. Note we are using PostgreSQL v12.

This murdered our forum. The docker container came back up, but the forum didn’t. Luckily I took a VM snapshot before upgrading, restoring it now.

Log:

Tasks: TOP => db:migrate
(See full trace by running task with --trace)
I, [2022-04-14T15:20:51.896917 #1]  INFO -- : == 20220304162250 EnableUnaccentExtension: migrating ==========================
-- enable_extension("unaccent")

I, [2022-04-14T15:20:51.897218 #1]  INFO -- : Terminating async processes
I, [2022-04-14T15:20:51.897265 #1]  INFO -- : Sending INT to HOME=/var/lib/postgresql USER=postgres exec chpst -u postgres:postgres:ssl-cert -U postgres:postgres:ssl-cert /usr/lib/postgresql/12/bin/postmaster -D /etc/postgresql/12/main pid: 1710
I, [2022-04-14T15:20:51.897396 #1]  INFO -- : Sending TERM to exec chpst -u redis -U redis /usr/bin/redis-server /etc/redis/redis.conf pid: 1827
2022-04-14 15:20:51.897 UTC [1710] LOG:  received fast shutdown request
1827:signal-handler (1649949651) Received SIGTERM scheduling shutdown...
2022-04-14 15:20:51.900 UTC [1710] LOG:  aborting any active transactions
2022-04-14 15:20:51.902 UTC [1710] LOG:  background worker "logical replication launcher" (PID 1719) exited with exit code 1
2022-04-14 15:20:51.904 UTC [1714] LOG:  shutting down
1827:M 14 Apr 2022 15:20:51.913 # User requested shutdown...
1827:M 14 Apr 2022 15:20:51.914 * Saving the final RDB snapshot before exiting.
2022-04-14 15:20:51.965 UTC [1710] LOG:  database system is shut down
1827:M 14 Apr 2022 15:20:53.157 * DB saved on disk
1827:M 14 Apr 2022 15:20:53.157 # Redis is now ready to exit, bye bye...


FAILED
--------------------
Pups::ExecError: cd /var/www/discourse && su discourse -c 'bundle exec rake db:migrate' failed with return #<Process::Status: pid 2118 exit 1>
Location of failure: /usr/local/lib/ruby/gems/2.7.0/gems/pups-1.1.1/lib/pups/exec_command.rb:117:in `spawn'
exec failed with the params {"cd"=>"$home", "hook"=>"db_migrate", "cmd"=>["su discourse -c 'bundle exec rake db:migrate'"]}
bootstrap failed with exit code 1
** FAILED TO BOOTSTRAP ** please scroll up and look for earlier error messages, there may be more than one.
./discourse-doctor may help diagnose the problem.
2dcd9aeca614c9e06ef748f673eb68203db6eae5c445253b416d666663879d6d
==================== END REBUILD LOG ====================
Failed to rebuild app.

They aren’t separate and no external PG. The PG13 upgrade also fails (non-destructively, unlike today’s upgrade) and I didn’t get support on how to fix it in here, to be honest.

I believe (can’t check obv) we only had one container showing up in docker ps. The standard install is 2 containers now?

This extension became available as a “trusted” extension on PostgreSQL 13+, where it can be enabled by any users.

Since you are running an older PostgreSQL version, you will have to workaround this, by installing and enabling this extension for the Discourse user, and potentially trying to trick Discourse in considering this extension as already installed. That or moving to the current supported version of PostgreSQL.

1 Like

OK. So in summary, you are no longer supporting PG12. Suggest posting that to the PG13 upgrade thread somewhere, and probably in the 2.9.0b4 annoucements too.

1 Like

If you’re worried about down time then one solution would be to is replicate the server to the new host. Something like How to set up PostgreSQL for high availability and replication with Hot Standby  |  Google Cloud Platform Community. (You might be able to find more recent instructions or some that make more sense to you…). That would let you migrate to the new database while the old one continued to function. But that’s far from simple.

That’s a neat idea, if it was mysql, ms-sql, or oracle I’d do that, but given my lack of experience with postgres I’d probably just eat the downtime.

1 Like

So my restore finally finished, after over 4 hours. And Discourse was shooting 502 errors. This snapshot was taken before the upgrade, so this is super bizarre.

Anyway looking at the nginx logs I found this error

2022/04/14 19:36:21 [error] 493#493: *350 connect() failed (111: Connection refused) while connecting to upstream, client: 216.228.112.21, server: _, request: "POST /message-bus/15f7a893581d489e930634c8f3ed1134/poll?dlp=t HTTP/2.0", upstream: "http://127.0.0.1:3000/message-bus/15f7a893581d489e930634c8f3ed1134/poll?dlp=t", host: "forum.quartertothree.com", referrer: "https://forum.quartertothree.com/c/movies/8"

and then in the ruby logs,

/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/bootsnap-1.10.3/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:30:in `require': cannot load such file -- /var/www/discourse/lib/freedom_patches/schema_cache_concurrency.rb (LoadError)

Sure enough, this file was owned by root:root and chmod 0000. Changing it to discourse:root and 644 to match the other files in that directory got us back up and running. Phew!

Any idea how that file got nuked/changed? It’s 0 bytes too, super weird.


root@forum-app:/shared/log/rails# ls -la /var/www/discourse/lib/freedom_patches/schema_cache_concurrency.rb
-rw-r--r-- 1 discourse root 0 Feb 10 17:41 /var/www/discourse/lib/freedom_patches/schema_cache_concurrency.rb
1 Like

This :point_up:

Now we are facing the whole “not enough space because the upgrade requires 300gb” all over again.

Good luck to you. As far as I can tell there is no real solution other than restoring a backup to a fresh host.

Thanks, but basically dead.

Tried to do the “Restore to a Fresh Install”, but it doesn’t work, PG doesn’t cooperate. Tried to downgrade the discourse version and go back to PG12, but then it becomes a carnival of light with all the plugins.

Did you try to rebuild with this postgres template instead of the default one?

discourse_docker/postgres.12.template.yml at main · discourse/discourse_docker · GitHub

1 Like

First of all: Thanks for helping.

Yes, I did change the template, as part of the “ok this is not working” strategy so that I could go back to PG12 (even though this makes me question how am I going to upgrade PG then :thinking: ).

I had to “hunt” for a specific commit but apparently this one is a safe bet: Version bump to v2.8.0.beta10 (#15382) · discourse/discourse@07c0104 · GitHub

I tried more recent ones but the whole enable_extension("unaccent") error is still present, which implies that on those commits the change to depend on it was already done.

Waiting on the result of this try.

Update: Nope, failed on the restore during the “unpacking dump” phase and now is dead again.

Hi, If you have a backup of discourse, I’d suggest doing this on a different server to try it out first.

I believe that you experienced this issue while upgrading an instance of discourse that was running an older version.

So, Try installing a copy of discourse by manually editing the yml to use discourse “stable” and pin postgres version to 12.

If the build succeeds, try restoring the backup. Hopefully, it will restore successfully.
If it succeeds, change postgres 12 template back to the default postgres template and comment out the stable tag so that discourse rebuilds with latest tests-passed.

I believe that if the backup is salvageable, it should then be able to survive the postgres and discourse upgrades.

Let me know if you encounter any issues.

2 Likes

I’m pretty much in “limbo” right now. Tried your suggestion with PG12 and “Stable”. It doesn’t work, the restore just stops. So I’m wiping the machine once again to try again fresh because now it won’t do app rebuild.

While that machine is trying to “go back to PG12” I’m trying to see if I can move forward on the other: If I try to upgrade PG with a fresh install it dies after Creating missing functions in the discourse_functions schema... (starts returning 500) and tail -f shared/data/log/var-log/postgres/current shows that the data container is “moving”, although just full of “errors” such as this:

discourse@discourse ERROR:  relation "user_auth_tokens" does not exist at character 34
discourse@discourse STATEMENT:  SELECT "user_auth_tokens".* FROM "user_auth_tokens" WHERE ((auth_token = 'XXXX=' OR
                                  prev_auth_token = 'XXXX=') AND rotated_at > '2022-03-09 10:21:44.051357') LIMIT 1
discourse@discourse ERROR:  relation "application_requests" does not exist at character 41
discourse@discourse STATEMENT:  SELECT "application_requests"."id" FROM "application_requests" WHERE "application_requests"."date" = '2022-05-08' AND "application_requests"."req_type" = 0 LIMIT 1

However, Discourse may be dead but the machine Is being used, so… is it working but it just takes hours and hours? Because I let it roll for 1h+ and nothing changed.

At this point I’m already thinking if this shouldn’t go here, lol.

PD: I tried 2 different backups, since I took 2 before the adventure.

I don’t understand what do you mean by just stops. Are you presented with some error? Do you experience frozen screen? Something else?

1 Like

That’s not good-- so you couldn’t restore your backup from the old version of Discourse with PG12 to a fresh new one with PG13? Can you post the error messages etc? Everybody here repeatedly assured me that would work.

1 Like

In what sense? Is there an error?