Rake:rebake crashes with errors: PG::ConnectionBad: PQsocket

I migrated a 200,000-post forum to a new server. The live site was put in read-only mode so there wouldn’t be downtime.

I restored the backup on a different subdomain so that users wouldn’t see the install screens or any problems that might happen during the restore – something like dev.example.com.

As soon as the restore was complete, I pointed the DNS as the new server and changed the app.yml file’s domain to the normal forum.example.com.

Then all the smilies in the based posts were pointing at the dev.example.com server, so I ran rake:rebake.

It gets through about 1,000-2,000 posts before crashing with errors about the database connection.

Here are a couple of excerpts:

/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.4/lib/bundler/vendor/thor/lib/thor.rb:392:in `dispatch'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.4/lib/bundler/cli.rb:34:in `dispatch'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.4/lib/bundler/vendor/thor/lib/thor/base.rb:485:in `start'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.4/lib/bundler/cli.rb:28:in `start'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.4/exe/bundle:45:in `block in <top (required)>'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.4/lib/bundler/friendly_errors.rb:117:in `with_friendly_errors'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.4/exe/bundle:33:in `<top (required)>'
/usr/local/bin/bundle:25:in `load'
/usr/local/bin/bundle:25:in `<main>'
     1999 / 200968 (  1.0%)
Failed to rebake (topic_id: 78730, post_id: 210607)
PG::ConnectionBad: PQsocket() can't get socket descriptor
/var/www/discourse/lib/tasks/posts.rake:108:in `rebake_posts_all_sites'
/var/www/discourse/lib/tasks/posts.rake:7:in `block in <main>'
/usr/local/bin/bundle:25:in `load'
/usr/local/bin/bundle:25:in `<main>'

Caused by:
PG::ConnectionBad: PQsocket() can't get socket descriptor

At the moment, I have the images loading by redirecting the dev.example.com domain to the forum.example.com domain, but it’s just a temporary solution.

Does anyone know how to get past that error so I can rebake all the posts? Is it creating too much load on the database?

1 Like

First, see Change the domain name or rename your Discourse (though another solution is to backup and then restore with the new hostname).

My guess is that you’re running out of connections to the database, but that isn’t quick the error I would expect.

Is this a standard install or are you using some other PG server?

1 Like

Thanks for the links. It’s a standard Docker install on a DigitalOcean droplet (“Premium AMD”, 4GB RAM, 2 vCPUs).

I followed the instructions in the link you mentioned. I found some of the wrong URLs in the settings, so I fixed those and rebuilt the forum again to be safe.

Then I ran this kind of command:

discourse remap dev.example.com forum.example.com

That command crashed with this kind of error:

Error: ERROR:  duplicate key value violates unique constraint "unique_post_links"
DETAIL:  Key (topic_id, post_id, url)=(78821, 207117, https://forum.example.com/t/the-slug/78946/7) already exists.

So I temporarily deleted a post that linked to the mentioned URL (https://forum.example.com/t/the-slug/78946/7), ran the command again, and it worked without crashing.

Then I did rake posts:rebake again.

It failed on a few posts like this, but kept going (I rebuilt the HTML for those posts manually):

Rebaking post markdown for 'default'
     2273 / 200996 (  1.1%)
Failed to rebake (topic_id: 66586, post_id: 210353)
JavaScript was terminated (either by timeout or explicitly)

Finally it crashed just before it reached 11,000 posts with errors like this:

/usr/local/bin/bundle:25:in `<main>'
    10996 / 200996 (  5.5%)
Failed to rebake (topic_id: 76678, post_id: 200988)
PG::ConnectionBad: PQsocket() can't get socket descriptor
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/pg.rb:69:in `exec_params'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/pg.rb:69:in `exec_params'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.4.1/lib/active_record/connection_adapters/postgresql_adapter.rb:768:in `block (2 levels) in exec_no_cache'

The entire server seemed to go offline because I got pinged by Uptime Robot that the site was down.

Do you think the server isn’t powerful enough to run that command? :thinking:

It’s running at over 80% RAM normally, and peaks at 100% while the command is running. Maybe it just ran out of memory.

If you have local disk, you can add swap, which would avoid memory exhaustion (whether or not that’s the cause of trouble here.) What does free tell you? Do you see oom or memory in the output of dmesg?

1 Like

At the moment it says:

               total        used        free      shared  buff/cache   available
Mem:           3.8Gi       2.1Gi       160Mi       1.0Gi       1.6Gi       488Mi
Swap:             0B          0B          0B

I don’t see oom, but the word memory appears in a few places about reserving and freeing memory.

The server was created with 4GB RAM, so Discourse didn’t automatically create a swapfile. Do you think it’s worth adding?

If you have the disk space, it’s certainly worth adding say 2G of swap.

The other thing to do is to monitor usage while your large job is running. I would use vmstat 5 5 and perhaps log to a file. You’re hoping not to see large values in si or so columns, and not to see the swpd column get close to the size of your swap.

Perhaps see this post:

(It seems possible that the database system is running out of some resource, but I don’t know anything about that.)

1 Like

Thanks, I’ll try those things later today. I have 50GB free at the moment.

I added a 2GB swapfile, and that seems to have fixed it. The rebaking is only 20% done, but there hasn’t been a single error yet, and RAM usage is just under 100%.

Thank you both for your help.

2 Likes

Sounds good! Just for the record

  • you could add more swap, even as the task is running, if vmstat or free (or top) show that swap is getting exhausted.
  • if you’re careful, you could probably do a reversible temporary upgrade to a larger-RAM instance, which will cost a little money, but need only be in place for a few hours. It’s important not to move to a larger-disk instance as that is not reversible. (More RAM should allow things to run at full speed, whereas modest RAM and lots of swap might have a performance penalty, and the task will take longer to finish.)
2 Likes

I thought about it, but I would have to shut the server off, and users already had an annoying “read-only” period and downtime when I moved servers. :sweat:

I wasn’t able to finish it last night because I had to go to sleep, but it’s running again now. 30% so far without any errors.

1 Like

Do keep an eye on things with vmstat or similar - that’s such a long-running job, you don’t want to have to restart it. I’d probably add another 2G of swap, to be safe.

1 Like

Thanks, I checked with vmstat occasionally. I started it in a tmux session so I could detach and close my laptop for a while. It probably took 8-9 hours to run the command, but everything completed without errors.

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.