PostgreSQL 15 update

What’s the output of:

tail /var/discourse/shared/standalone/log/var-log/postgres/current
root@ubuntu-s-1vcpu-1gb-nyc3-01:/var/discourse# tail /var/discourse/shared/standalone/log/var-log/postgres/current
	                INNER JOIN (SELECT topic_id, GREATEST(COUNT(*), 1) AS count
	             FROM posts
	             WHERE created_at >= '2025-02-01 04:44:07.521324'
	               AND deleted_at IS NULL
	               AND NOT hidden
	               AND post_type = 1
	               AND user_id <> -1
	             GROUP BY topic_id) c ON tt.topic_id = c.topic_id
	                WHERE tt.topic_id = top_topics.topic_id
	                  AND tt.daily_posts_count <> c.count
root@ubuntu-s-1vcpu-1gb-nyc3-01:/var/discourse# vim shared/standalone/log/var-log/postgres/current
root@ubuntu-s-1vcpu-1gb-nyc3-01:/var/discourse#
root@ubuntu-s-1vcpu-1gb-nyc3-01:/var/discourse#
root@ubuntu-s-1vcpu-1gb-nyc3-01:/var/discourse# cat shared/standalone/log/var-log/postgres/current
2025-02-02 04:42:42.765 UTC [542] LOG:  starting PostgreSQL 13.18 (Debian 13.18-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2025-02-02 04:42:42.768 UTC [542] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2025-02-02 04:42:42.768 UTC [542] LOG:  listening on IPv6 address "::", port 5432
2025-02-02 04:42:42.779 UTC [542] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2025-02-02 04:42:42.804 UTC [561] LOG:  database system was interrupted; last known up at 2025-02-02 04:37:10 UTC
2025-02-02 04:42:43.209 UTC [561] LOG:  database system was not properly shut down; automatic recovery in progress
2025-02-02 04:42:43.221 UTC [561] LOG:  redo starts at 2/DB0B59D0
2025-02-02 04:42:43.264 UTC [561] LOG:  invalid record length at 2/DB22D540: wanted 24, got 0
2025-02-02 04:42:43.264 UTC [561] LOG:  redo done at 2/DB22D518
2025-02-02 04:42:43.347 UTC [542] LOG:  database system is ready to accept connections
2025-02-02 04:43:49.036 UTC [1273] discourse@discourse LOG:  duration: 584.511 ms  statement: UPDATE topic_hot_scores thsOrig
	SET
	    recent_likes = COALESCE(new_values.likes_count, 0),
	    recent_posters = COALESCE(new_values.unique_participants, 0),
	    recent_first_bumped_at = COALESCE(new_values.first_bumped_at, ths.recent_first_bumped_at)
	FROM
	  topic_hot_scores ths
	  LEFT OUTER JOIN
	  (
	    SELECT
	        t.id AS topic_id,
	        COUNT(DISTINCT p.user_id) AS unique_participants,
	        (
	          SELECT COUNT(distinct pa.user_id)
	          FROM post_actions pa
	          JOIN posts p2 ON p2.id = pa.post_id
	          WHERE p2.topic_id = t.id
	            AND p2.post_type = 1
	            AND p2.deleted_at IS NULL
	            AND p2.user_deleted = false
	            AND pa.post_action_type_id = 2 -- action_type for 'like'
	            AND pa.created_at >= '2025-01-26 04:43:48.403603'
	            AND pa.deleted_at IS NULL
	        ) AS likes_count,
	        MIN(p.created_at) AS first_bumped_at
	    FROM
	        topics t
	    JOIN
	        posts p ON t.id = p.topic_id
	    WHERE
	        p.created_at >= '2025-01-26 04:43:48.403603'
	        AND t.archetype <> 'private_message'
	        AND t.deleted_at IS NULL
	        AND p.deleted_at IS NULL
	        AND p.user_deleted = false
	        AND t.created_at <= '2025-02-02 04:43:48.403603'
	        AND t.bumped_at >= '2025-01-26 04:43:48.403603'
	        AND p.created_at < '2025-02-02 04:43:48.403603'
	        AND p.created_at >= '2025-01-26 04:43:48.403603'
	        AND p.post_type = 1
	    GROUP BY
	        t.id
	  ) AS new_values
	ON ths.topic_id = new_values.topic_id

	WHERE thsOrig.topic_id = ths.topic_id

2025-02-02 04:44:04.629 UTC [1273] discourse@discourse LOG:  duration: 153.751 ms  statement: WITH x AS (SELECT
	                    u.id user_id,
	                    SUM(CASE WHEN p.id IS NOT NULL AND t.id IS NOT NULL AND ua.action_type = 2 THEN 1 ELSE 0 END) likes_received,
	                    SUM(CASE WHEN p.id IS NOT NULL AND t.id IS NOT NULL AND ua.action_type = 1 THEN 1 ELSE 0 END) likes_given,
	                    COALESCE((SELECT COUNT(topic_id) FROM topic_views AS v WHERE v.user_id = u.id AND v.viewed_at > '2025-02-01 04:44:04.456893'), 0) topics_entered,
	                    COALESCE((SELECT COUNT(id) FROM user_visits AS uv WHERE uv.user_id = u.id AND uv.visited_at > '2025-02-01 04:44:04.456893'), 0) days_visited,
	                    COALESCE((SELECT SUM(posts_read) FROM user_visits AS uv2 WHERE uv2.user_id = u.id AND uv2.visited_at > '2025-02-01 04:44:04.456893'), 0) posts_read,
	                    SUM(CASE WHEN t2.id IS NOT NULL AND ua.action_type = 4 THEN 1 ELSE 0 END) topic_count,
	                    SUM(CASE WHEN p.id IS NOT NULL AND t.id IS NOT NULL AND ua.action_type = 5 THEN 1 ELSE 0 END) post_count
	                  FROM users AS u
	                  LEFT OUTER JOIN user_actions AS ua ON ua.user_id = u.id AND COALESCE(ua.created_at, '2025-02-01 04:44:04.456893') > '2025-02-01 04:44:04.456893'
	                  LEFT OUTER JOIN posts AS p ON ua.target_post_id = p.id AND p.deleted_at IS NULL AND p.post_type = 1 AND NOT p.hidden
	                  LEFT OUTER JOIN topics AS t ON p.topic_id = t.id AND t.archetype = 'regular' AND t.deleted_at IS NULL AND t.visible
	                  LEFT OUTER JOIN topics AS t2 ON t2.id = ua.target_topic_id AND t2.archetype = 'regular' AND t2.deleted_at IS NULL AND t2.visible
	                  LEFT OUTER JOIN categories AS c ON t.category_id = c.id
	                  WHERE u.active
	                    AND u.silenced_till IS NULL
	                    AND u.id > 0
	                  GROUP BY u.id)
	      UPDATE directory_items di SET
	               likes_received = x.likes_received,
	               likes_given = x.likes_given,
	               topics_entered = x.topics_entered,
	               days_visited = x.days_visited,
	               posts_read = x.posts_read,
	               topic_count = x.topic_count,
	               post_count = x.post_count
	      FROM x
	      WHERE
	        x.user_id = di.user_id AND
	        di.period_type = 5 AND (
	        di.likes_received <> x.likes_received OR
	        di.likes_given <> x.likes_given OR
	        di.topics_entered <> x.topics_entered OR
	        di.days_visited <> x.days_visited OR
	        di.posts_read <> x.posts_read OR
	        di.topic_count <> x.topic_count OR
	        di.post_count <> x.post_count )


2025-02-02 04:44:07.657 UTC [1400] discourse@discourse LOG:  duration: 135.675 ms  statement: UPDATE top_topics
	                SET daily_posts_count = c.count
	                FROM top_topics tt
	                INNER JOIN (SELECT topic_id, GREATEST(COUNT(*), 1) AS count
	             FROM posts
	             WHERE created_at >= '2025-02-01 04:44:07.521324'
	               AND deleted_at IS NULL
	               AND NOT hidden
	               AND post_type = 1
	               AND user_id <> -1
	             GROUP BY topic_id) c ON tt.topic_id = c.topic_id
	                WHERE tt.topic_id = top_topics.topic_id
	                  AND tt.daily_posts_count <> c.count

If that is the output you see after stopping your app container, it might mean that something is still connecting to the database. Check pg_stat_activity to see if you can isolate it. You can try stopping these services in the app container to help narrow it down.

./launcher start app
./launcher enter app
sv stop nginx
sv stop unicorn
1 Like

I just checked and I only see

2025-02-02 12:06:05.808 UTC [48] LOG:  received smart shutdown request

So I guess the database did not shutdown correctly and was forcibly shutdown after timeout?

It would seem so. Could you try to follow the steps in this previous post and see if there are any other client connections to the database? Ideally, you should stop any other applications that connect to the database before stopping the app container.

Alternatively, you can try to terminate all established database sessions and quickly stop the postgres service (ideally before the client apps reconnect) then attempt another rebuild after confirming a clean database shutdown from the logs. However, I strongly recommend that you first identify the impact on your client applications listed in pg_stat_activity before terminating their connections.

Here’s a sample command you can run to terminate client connections and stop postgres from inside the app container after stopping nginx and unicorn first.

sudo -u postgres psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid <> pg_backend_pid();" && sv stop postgres
3 Likes

Just a quick update: we’ve applied a patch that should return a more appropriate error message if the database did not shut down cleanly.

4 Likes

So first, thanks a lot! Manually stopping the services inside the container apparently did the trick, and I was able to the double container rebuild for the upgrade (with locale variables adjusted as discussed above - side note: I checked the installation docs and I don’t think they mention locales; maybe this would be a good addition).

Unfortunately, the analysis of problematic processes was inconclusive. The only thing I see is Postgres-related things like WalWriter, AutoVacuum, etc. The only clue I have is that when I reboot the system and also after index changes on updates, I typically see a high CPU load for postgres for about half an hour.
And when I checked pg_stat_activity today after the reboot, I saw two long running queries (at least if my understanding of the columns is correct):

SELECT "posts"."id" FROM "posts" INNER JOIN "topics" ON "topics"."deleted_at" IS NULL AND "topics"."id" = "posts"."topic_id" LEFT JOIN post_search_data ON post_id = posts.id WHERE "posts"."deleted_at" IS NULL AND (posts.raw != '') AND (topics.deleted_at IS NULL) AND (post_search_data.locale IS NULL OR post_search_data.locale != 'de' OR post_search_data.version != 5) ORDER BY posts.id DESC LIMIT 20000

and

SELECT "optimized_images".* FROM "optimized_images" WHERE "optimized_images"."upload_id" = 13 AND "optimized_images"."height" = 32 AND "optimized_images"."width" = 32 LIMIT 1

After the aforementioned 30 minutes, the first one was apparently completed and the CPU load from postgres is now normal.
I’m not sure why these two queries take so long and even less why the second one is still showing up in pg_stat_activity after more than an hour, but potentially such long-running queries blocked the postgres service from shutting down correctly when trying to rebuild it earlier.

If you have any lead here, that’ll be much appreciated. But potentially this is also something that might be go into another thread (if so, just let me know, and I’ll edit this post).

Related screenshots:

2 Likes

Glad to hear that worked! I’ve now included those steps in OP.

On the high postgres CPU load on startup, it sounds like you’ve been experiencing the same issue even before the PG version update. If so, that’s unrelated to the update. (Regardless, it’s recommended to run a vacuum after a PG version update. See the post-update tasks.)

Poor database/query performance could be caused by a myriad of issues. It could be corrupt indexes in the database, an under-provisioned host, a bug in Discourse which introduced a problematic query etc. It might help to open a Support topic for that.

4 Likes

6 posts were split to a new topic: Site offline after rebuild (4th Feb 2025)

(post deleted by author)