Postgres problem... again!

Hi, I’ve two big problem when stopping and starting again Discourse.

After 2/3 times I do it (no matter if with ./launcher stop or via Portainer stop), container refuses to start again, continuosly stuck on “rm: cannot remove ‘/shared/nginx.http.sock’: Is a directory”. (I’ve just noticed the socket is not in “shared” dir, but in “shared/standalone”, is it just a message mistake?)
For some reason still unknown to me, this socket, maybe after stop, become a directory, and the template can’t delete it because it doesn’t manage to delete a directory, but a socket. Deleting it manually change nothing, it reappears everytime.

“./launcher rebuild app” it stucks on “FATAL: the database system is starting up”, after the first “PANIC could not locate a valid checkpoint record”. I’ve read and tried everything I found regarding this problem, and the only working “solution” I found was deleting discourse dir with evertything inside and setup again all the stuff… obviously not a real solution!

It seems that sometimes stopping Discourse container leaves the database in a bad way, and it can’t continue because “is starting up”, maybe trying to fix something. But I still haven’t found how to fix this problem, that seems to raise up after some stop&start.

Any clue? Some way to leave Postgres fix its problems?

No it’s not a mistake. This file is in the volume that gets mounted in the container, so it has different paths depending on the point of view, aka inside the container vs outside the container.

This happens when the container is not stopped properly. PostgreSQL needs some time to properly shutdown, and we do that when you stop with our launcher command. However, if your instance if of a big enough size, or there are too many running transactions the database may struggle to stop properly in the deadline it receives.

But I find this “nginx.http.sock” created on “shared/standalone” in the host file system, too. Initially pink, as a socket, but after a while, maybe a stop, it becomes blue, as a directory, and container refuses to start stuck on trying deleting a socket… become a directory.

So, what can we do in order to fix it? Until now, I’m just experimenting, but should I go online with some hundred people connected and hundred thousands post, will I risk to loose everything just because Postgres doesn’t manage abruptly interruption? It will be something to do in order to fix a damaged database. Can Discourse run on external Postgres, something that we can manage if Discourse container doesn’t start? In short, in case of “PANIC” or “FATAL”, there should be some fix…

Actually trying to solve the problem (for future) by setting up a “containerized” Postgres and attaching it to Discourse, in the hope not to damage db (or at least be able to do some maintenance even with Discourse down) when stopping Discourse.