Unusually high CPU usage

One last tidbit of information, then I think I’ll be out for a few hours.

root@discourse_app:/# ps aux --sort=-%mem | head -20
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres    4398 17.0 22.0 8216352 6797764 ?     Ss   11:14   9:53 postgres: 15/main: discourse discourse [local] UPDATE
postgres    2729 15.3 21.1 8123888 6517308 ?     Ss   11:13   9:03 postgres: 15/main: discourse discourse [local] UPDATE
postgres    2501 15.9 19.9 8079700 6148812 ?     Ds   11:13   9:25 postgres: 15/main: discourse discourse [local] UPDATE
postgres   22777 16.9 19.5 8084888 6012052 ?     Ds   11:42   4:58 postgres: 15/main: discourse discourse [local] UPDATE
postgres    2753 28.5 11.3 8055000 3482260 ?     Ss   11:13  16:50 postgres: 15/main: discourse discourse [local] idle
postgres   25715  2.9  6.9 7884064 2135536 ?     Ss   11:47   0:44 postgres: 15/main: discourse discourse [local] idle
postgres   20487  2.9  6.6 7885300 2061088 ?     Ss   11:39   0:59 postgres: 15/main: discourse discourse [local] idle
postgres   22055  3.3  6.5 7887336 2012504 ?     Ss   11:41   1:02 postgres: 15/main: discourse discourse [local] idle
postgres   25883  2.5  6.0 7884096 1848424 ?     Ss   11:47   0:38 postgres: 15/main: discourse discourse [local] idle
postgres   28126  2.4  5.6 7883848 1744912 ?     Ss   11:50   0:31 postgres: 15/main: discourse discourse [local] idle
postgres   29365  1.0  4.5 7883084 1386544 ?     Ss   11:52   0:12 postgres: 15/main: discourse discourse [local] idle
postgres   27172  1.6  4.4 7884288 1384664 ?     Ss   11:49   0:22 postgres: 15/main: discourse discourse [local] idle
postgres   25896  2.1  4.4 8034236 1357264 ?     Ss   11:47   0:31 postgres: 15/main: discourse discourse [local] idle
postgres      89  1.7  4.3 7864156 1342760 ?     Ss   11:11   1:04 postgres: 15/main: checkpointer
postgres   28505  1.0  4.2 7884360 1315360 ?     Ss   11:51   0:13 postgres: 15/main: discourse discourse [local] idle
postgres   27175  1.6  4.1 7882780 1277612 ?     Ss   11:49   0:23 postgres: 15/main: discourse discourse [local] idle
postgres   28553  0.9  3.4 7883976 1064964 ?     Ss   11:51   0:11 postgres: 15/main: discourse discourse [local] idle
postgres   30409  1.0  3.3 7882892 1034860 ?     Ss   11:54   0:10 postgres: 15/main: discourse discourse [local] idle
postgres   40651  4.6  1.9 7872036 592152 ?      Ss   12:11   0:03 postgres: 15/main: discourse discourse [local] idle
root@discourse_app:/# redis-cli info memory
# Memory
used_memory:179899224
used_memory_human:171.57M
used_memory_rss:47591424
used_memory_rss_human:45.39M
used_memory_peak:184509776
used_memory_peak_human:175.96M
used_memory_peak_perc:97.50%
used_memory_overhead:3681093
used_memory_startup:948600
used_memory_dataset:176218131
used_memory_dataset_perc:98.47%
allocator_allocated:181437808
allocator_active:182353920
allocator_resident:188317696
allocator_muzzy:0
total_system_memory:31537295360
total_system_memory_human:29.37G
used_memory_lua:58368
used_memory_vm_eval:58368
used_memory_lua_human:57.00K
used_memory_scripts_eval:10304
number_of_cached_scripts:13
number_of_functions:0
number_of_libraries:0
used_memory_vm_functions:33792
used_memory_vm_total:92160
used_memory_vm_total_human:90.00K
used_memory_functions:192
used_memory_scripts:10496
used_memory_scripts_human:10.25K
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.00
allocator_frag_bytes:700208
allocator_rss_ratio:1.03
allocator_rss_bytes:5963776
rss_overhead_ratio:0.25
rss_overhead_bytes:-140726272
mem_fragmentation_ratio:0.26
mem_fragmentation_bytes:-132268896
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_total_replication_buffers:0
mem_clients_slaves:0
mem_clients_normal:498197
mem_cluster_links:0
mem_aof_buffer:0
mem_allocator:jemalloc-5.3.0
mem_overhead_db_hashtable_rehashing:0
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0
root@discourse_app:/# cat /etc/postgresql/15/main/postgresql.conf | grep shared_buffers
shared_buffers = 7424MB
#wal_buffers = -1                       # min 32kB, -1 sets based on shared_buffers
root@discourse_app:/# su - postgres -c "psql discourse -c \"SELECT pid, query_start, state, wait_event_type, wait_event, left(query, 100) as query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start;\""
  pid  |          query_start          | state  | wait_event_type |  wait_event  |                                                query
-------+-------------------------------+--------+-----------------+--------------+------------------------------------------------------------------------------------------------------
  2501 | 2026-02-07 11:25:01.028892+00 | active | IO              | DataFileRead | UPDATE posts                                                                                        +
       |                               |        |                 |              | SET percent_rank = X.percent_rank                                                                   +
       |                               |        |                 |              | FROM (                                                                                              +
       |                               |        |                 |              |   SELECT posts.id, Y.percent_rank                                                                   +
       |                               |        |                 |              |   FROM posts
  4398 | 2026-02-07 11:52:53.108942+00 | active | IPC             | BufferIO     | WITH eligible_users AS (                                                                            +
       |                               |        |                 |              |   SELECT id                                                                                         +
       |                               |        |                 |              |   FROM users                                                                                        +
       |                               |        |                 |              |   WHERE id > 0 AND active AND silenced_till IS NUL
  2729 | 2026-02-07 11:54:27.666129+00 | active | IPC             | BufferIO     | UPDATE topics AS topics                                                                             +
       |                               |        |                 |              | SET has_summary = (topics.like_count >= 1 AND                                                       +
       |                               |        |                 |              |                    topics.post
 22777 | 2026-02-07 11:59:27.040575+00 | active | IO              | DataFileRead | UPDATE posts                                                                                        +
       |                               |        |                 |              | SET percent_rank = X.percent_rank                                                                   +
       |                               |        |                 |              | FROM (                                                                                              +
       |                               |        |                 |              |   SELECT posts.id, Y.percent_rank                                                                   +
       |                               |        |                 |              |   FROM posts
 27172 | 2026-02-07 12:15:42.50553+00  | active | IO              | DataFileRead | SELECT "posts"."id" FROM "posts" WHERE "posts"."deleted_at" IS NULL AND "posts"."topic_id" = 792311
 25883 | 2026-02-07 12:15:52.665883+00 | active |                 |              | SELECT "posts"."id" FROM "posts" WHERE "posts"."deleted_at" IS NULL AND "posts"."topic_id" = 829626
 20487 | 2026-02-07 12:16:09.733384+00 | active | IO              | DataFileRead | SELECT "posts"."id" FROM "posts" WHERE "posts"."deleted_at" IS NULL AND "posts"."topic_id" = 653216
 42185 | 2026-02-07 12:16:21.053706+00 | active | IO              | DataFileRead | SELECT "posts"."id", "posts"."user_id", "posts"."topic_id", "posts"."post_number", "posts"."raw", "p
 43940 | 2026-02-07 12:16:21.925505+00 | active |                 |              | SELECT pid, query_start, state, wait_event_type, wait_event, left(query, 100) as query FROM pg_stat_
 28126 | 2026-02-07 12:16:21.96218+00  | active | IO              | DataFileRead | SELECT "posts"."id" FROM "posts" WHERE "posts"."deleted_at" IS NULL AND "posts"."topic_id" = 818063
 42323 | 2026-02-07 12:16:21.966689+00 | active | Client          | ClientRead   | SELECT "discourse_post_event_events"."id", "discourse_post_event_events"."status", "discourse_post_e
(11 rows)

My question, basically, boils down to “what could create an UPDATE query that hangs for 9 hours?”.

I´d hypothesize having not enough memory: queries go into swap.
Is having a 40GB posts table a potential issue?


root@discourse_app:/# su - postgres -c "psql discourse -c \"SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;\""
 schemaname |     tablename     |  size
------------+-------------------+---------
 public     | posts             | 40 GB
 public     | post_search_data  | 4326 MB
 public     | topic_users       | 1306 MB
 public     | topics            | 837 MB
 public     | topic_search_data | 702 MB
 public     | post_replies      | 567 MB
 public     | top_topics        | 512 MB
 public     | user_actions      | 417 MB
 public     | topic_links       | 285 MB
 public     | directory_items   | 243 MB

Have you tried tweaking the postgres memory settings?

If moving to a new vm would fix anything, it’s not a big deal and can be done with zero downtime and just a bit of read only time.

1 Like

Yes, we already did once because we had a kinda similar issue - it was clearly Postgres running out of memory.

This time, the symptoms are different - this doesn’t mean we’re excluding the possibility of having a similar issue, though.

Right now many clues point to having some thread structures being too heavy for Discourse internal tasks: this installation comes from a DIY conversion from a vbb forum - and I must state again that we had pretty much no issues for 2 years straight, then last year we had to tweak postgres settings, then smooth sailing until ~10 days ago, and then again after a few days from the last version upgrade.

The point is that we have many ancient threads that Discourse hasn’t split into 5000 posts chunks, we have thousands of accounts, and I’m starting to think that the previous imported posting history + the normal usage has reached a threshold where our hardware and Discourse architecture are struggling to handle the normal operations.

So, another of the questions I’m having right now, is that I’ve long ago noticed Discourse being a widely used forum software for huge commercial solutions (ie. Activision Blizzard). I understand these are paid installations and the Discourse team gets paid to offer proper support, and there’s the option of throwing money at the problem, but I can’t help but wonder how big their posts table can be, and I don’t think it can be smaller than ours. Still, I’d be surprised if a self-hosted solution would have issues with the active usage we have (~150 active users, ~2k new posts per day, more or less).

Also, the evening we put the forum in read only mode, the forum was lightning fast. Evidently, user activity, which includes posting but goes beyond that, is causing the situation we have.

Therefore I wonder if there’s any way to exclude part of the data frequently read (topics?) and updated (user stats?) in Discourse, for instance by locking topics, assigning a trust level 0 to inactive users, or something like that.

This feels like a good line of questioning to me: you might have lots of RAM, but postgres is configured to use it in certain ways, and might be making life hard for itself.

I don’t like that there are two UPDATE queries which look identical - that looks like it could be a scheduled task which took so long a second one got scheduled subsequently. It will be adding load in a situation where things are already not running well.

But I don’t know about how to do those tweaks.

1 Like

Last time we solved the issue we had by giving Postgres more space - it was using too few processes and too little working memory for the situation it had.

This time around I think we’ve hit the ceiling - either we lower the number of processes and increase the working memory, or the other way around, and see what happens.
We’ve tried the first solution a few hours ago, we’re waiting to see if this is having any beneficial effect.

1 Like

I’ll be interested to see what you find. I’ve been a linux admin but not a database admin. There’s a huge amount to know about both sides.

In the postgres docs I find

and

I note that discourse setup configures a couple of postgres parameters according to RAM size. Note that if you upscale your server those parameters won’t be adjusted for the larger RAM.

I feel that a lot of lore and practice might date from the time when 2G was a large server. It’s quite possible that bigger numbers are appropriate for today’s hardware.

Edit: on my (4G) servers I disable huge pages. I believe transparent (automatic) huge pages can cause the kernel to spend time merging and splitting. But I see in the postgres docs that huge pages can be beneficial in some situations.

On my system:

root@ubuntu-4gb-hel1-1:~# egrep Huge /proc/meminfo
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
root@ubuntu-4gb-hel1-1:~# egrep huge /proc/filesystems
nodev	hugetlbfs
root@ubuntu-4gb-hel1-1:~# 
root@ubuntu-4gb-hel1-1:~# head -v /proc/sys/vm/nr_hugepages /proc/sys/vm/nr_overcommit_hugepages
==> /proc/sys/vm/nr_hugepages <==
0

==> /proc/sys/vm/nr_overcommit_hugepages <==
0
root@ubuntu-4gb-hel1-1:~# egrep huge /sys/devices/system/node/node*/meminfo
root@ubuntu-4gb-hel1-1:~# 

Unless things changed, running ./discourse-setup should adjust the numbers but detect that it’s an already running installation. If I recall it also backup the app.yml file before making changes.

Don’t know if this changed recently as last time I did it was 3-4 years ago.

As far as I recall, it just change things like the shared memory (25% of max memory) and the unicorn web workers tho. I might be forgetting something.

(not swap, but they’re reading from disk)

this fits with my earlier observations:

Note this is the size of the relation and indexes. Compare with pg_relation_size.

This is from ScoreCalculator, part of PeriodicalUpdates.

This is your finding that needs to be solved. By comparison, here on meta Jobs::EnsureDbConsistency takes <2min and Jobs::TopRefreshOlder takes <10s:

Postgres needs more memory. Give it as much as you can.

You might also see benefit from a VACUUM ANALYZE or VACUUM ANALYZE FULL. Doing the first never hurts.

I’d probably do, in order:

  • vacuum analyze
  • pause sidekiq then vacuum analyze full (this freezes the tables to rewrite them fully, may incur some failures while it runs)
  • more memory to postgres
1 Like

I’ve had a couple of restores/upgrades fail during migration due to space, in spite of there being multiple GB of space remaining (like Restore fails due to disk space on migration).

Might doing a vacuum before backup help out with that issue?

Also

That’s

 discourse=# VACUUM FULL ANALYZE;