Very slow Sidekiq issue with large queue due to massive numbers of unread user notifications

markersocial · February 12, 2020, 5:29am

I’m mainly just stumped as to how the performance can bounce between completing ~11m and ~300k jobs in a day within ~1 week with the same configuration. A speed difference of ~35x in terms of jobs per second.

For the CPU usage, it’s back down to ~15-20% use which is about the usual. Processing jobs at the same speed (slow).

Just to clarify/confirm, I meant assigning (not adding) some sidekiqs to exclusively process the low priority queue, as it appeared like the low priority tasks can be processed at a much faster rate and possibly doesn’t suffer the same bottlenecks. I was speculating this might explain how the jobs per second can vary so drastically (i.e. low priority ‘easy’ tasks stuck behind the default queue backlog).

To clarify - do you think that postgresql performance is causing the slow job completion or just the high CPU usage event I noticed yesterday (that is now back to normal)?

Stephen · February 12, 2020, 7:15am

This is all on SSD, right?

markersocial · February 12, 2020, 7:19am

Yes correct @Stephen - NVMe SSDs RAID 1.

markersocial · February 12, 2020, 7:31am

Update: I tried deleting the low priority and default queue a few times with no impact on speed as the default queue just grows again immediately. I then tried deleting the default queue and enabling read-only mode. This made the jobs per second spike dramatically blazing through the low priority queue (~100x jobs per second speed).

Edit: Seems that even with just a large low priority queue, the processing speed is still slow. If I set Discourse to read-only, then empty both low and default priority queues. The jobs processing afterwards seems to stay super fast emptying the scheduled tasks and queues until I disable read-only mode.

bartv · February 12, 2020, 7:43am

My next step would be figuring out exactly which process is causing the trouble by going into the Discourse app and running htop or top to see the top CPU usage.

pfaffman · February 12, 2020, 11:15am

It does sound like postgres is the bottleneck. You might configure prometheus to track its performance and see that it’s getting access to enough ram.

markersocial · February 12, 2020, 12:06pm

Thanks for your input @pfaffman I think db_shared_buffers and db_work_mem in the app.yml are the only controls for postgresql RAM access right?

Have tinkered a bit both upwards and downwards. Current settings in the app.yml are:
db_shared_buffers: “32768MB”
db_work_mem: “128MB”

With total system ram of 128gb.

I’ve also tried changing max_connections in /var/discourse/shared/standalone/postgres_data/postgresql.conf then rebuilding Discourse. Tried values above the default (100), 200 to 500. Currently set at 300. Not sure if modifying it there is actually changing the max connections value though.

I see these in the /var/discourse/templates/postgres.template.yml

db_synchronous_commit: “off”
db_shared_buffers: “256MB”
db_work_mem: “10MB”
db_default_text_search_config: “pg_catalog.english”
db_name: discourse
db_user: discourse
db_wal_level: minimal
db_max_wal_senders: 0
db_checkpoint_segments: 6
db_logging_collector: off
db_log_min_duration_statement: 100

markersocial · February 12, 2020, 12:21pm

Thanks @bartv, following your suggestion I’ve been watching from inside the Discourse app via top. I’m seeing quite a lot of postmaster processes ran by the postgres user - amount of CPU usage varying. Screenshots represent extended periods of time with similar usage stats.

Using ~95% of 32 cores:

Using ~20%, lower cpu usage of postmaster.

Using ~6% cpu, while read-only mode was active.

pfaffman · February 12, 2020, 2:12pm

How big is your database? How many users do you have? How many new posts per day?

supermathie · February 12, 2020, 2:52pm

First thing you should do is run VACUUM ANALYZE; from the postgres console.

This might take a while to run; you might want to stop sidekiq temporarily to lighten the load while it works.

If that doesn’t help, we should enable pg_stat_statements and then check to see what queries are taking a huge amount of CPU.

markersocial · February 12, 2020, 3:04pm

@pfaffman

/var/discourse/shared/standalone/postgres_data folder is 170GB
61.7k active users in the last 30 days (not sure on the absolute total)
~50k to 80k new posts per day

pfaffman · February 12, 2020, 3:28pm

Oh. That’s non trivial.

You should read up on postgres tuning. That level of performance is a bit beyond the typical self hosting that’s seen here.

I would try giving postgres about 3/4 of the ram. I’d certainly split into separate data and web containers. But you might need some more complex postgres setup to give you the performance you need.

EDIT: But I don’t have experience with much larger databases, so see below!

supermathie · February 12, 2020, 7:36pm

We do much larger databases with much less RAM and without nearly as much CPU usage.

The pg_stat_statements information will likely be able to tell us what’s wrong.

markersocial · February 16, 2020, 2:51pm

Thanks a lot for the help guys.

So I tried running VACUUM ANALYZE; - no luck unfortunately. Commands used below, for reference:

cd /var/discourse/
./launcher enter app
sudo -u postgres psql
\c discourse
VACUUM ANALYZE;

I attempted to enable pg_stat_statements, steps done below:

Added/modified the lines below here: /var/discourse/shared/standalone/postgres_data/postgresql.conf

shared_preload_libraries = ‘pg_stat_statements’ # (change requires restart)
pg_stat_statements.max = 10000
pg_stat_statements.track = all
pg_stat_statements.track_utility = on
pg_stat_statements.save = off

Then rebuilt discourse and ran:

./launcher enter app
sudo -u postgres psql
\c discourse
CREATE EXTENSION pg_stat_statements:

I’ve tried to do queries, but I get this error:

ERROR: pg_stat_statements must be loaded via shared_preload_libraries

My guess is that my edits to the postgresql.conf file (/var/discourse/shared/standalone/postgres_data/postgresql.conf) are not working (I rebuilt Discourse after editing). Is it possible to make these edits via the app.yml file? Or see anything I’ve done wrong?

pfaffman · February 16, 2020, 4:45pm

Rebuilding discourse will erase those changes. Restarting the container will probably do the trick. (something like sv restart postgres inside the container might also do it).

markersocial · February 17, 2020, 6:47am

Thanks, I tried restarting the container:

./launcher stop app
./launcher start app

Still get the same error when attempting to query:

ERROR: pg_stat_statements must be loaded via shared_preload_libraries

The changes I made previously are still persisting in the file here even through a rebuild:

/var/discourse/shared/standalone/postgres_data/postgresql.conf

I’m suspecting that this file isn’t where I should be making these edits

Maybe due to only using just app.yml (and mail-receiver.yml) in the containers folder, have not implemented using data.yml.

pfaffman · February 17, 2020, 12:38pm

Without actually looking, It’s likely /etc/postgres inside the container. You might also need to install whatever library it’s requiring.

markersocial · February 19, 2020, 9:52am

Thanks, this helped a lot, at first. . The issue seemed solved and the jobs queue was going really fast from just increasing the max connections in the postgresql.conf. Unforunately it slowed down again after about a day.

Putting steps below, in case they are useful to others wanting to increase the postgresql max_connections.

docker ps

Get the container ID, e.g. aaabbbccc123, replace in the commands below:

Copy the postgresql.conf file from inside the docker container to the local file system:

docker cp aaabbbccc123:/etc/postgresql/10/main/postgresql.conf /srv

Edit the configuration:

nano /srv/postgresql.conf

Copy it back into the docker container:

docker cp /srv/postgresql.conf aaabbbccc123:/etc/postgresql/10/main/postgresql.conf

cd /var/discourse
./launcher stop app
./launcher start app

Delete the left over file (optional):

rm /srv/postgresql.conf

markersocial · February 19, 2020, 10:12am

@supermathie I believe I’ve successfully enabled pg_stat_statements

I tried using this query:

SELECT
(total_time / 1000 / 60) as total,
(total_time/calls) as avg,
query
FROM pg_stat_statements
ORDER BY 1 DESC
LIMIT 100;

From this guide: The most useful Postgres extension: pg_stat_statements

I can’t really read the result though, I think I’ve done something wrong.

pfaffman · February 19, 2020, 1:39pm

Now that you know what you want to do You can make these changes with a replace stanza in your app.yml.

Also, you could just ./launcher enter app and edit the files directly. Note, though, that when you rebuild those changes will not be in the new container.

Topic		Replies	Views
Slow Sidekiq + Postmaster using 95%+ CPU (32 cores) after Postgresql Version Upgrade Installation server-resources	23	2874	June 28, 2020
Notifications are coming again even after reading them [Private Topics plugin] Support notifications	20	1256	December 1, 2023
Could sidekiq queue be reason for 500 errors? Installation server-resources	31	3780	July 13, 2018
Long-Running Sidekiq Jobs Feature	21	1560	December 24, 2020
Sidekiq has a lot of errors and queued jobs Support	19	972	March 1, 2024

Very slow Sidekiq issue with large queue due to massive numbers of unread user notifications

Related topics