Very slow Sidekiq issue with large queue due to massive numbers of unread user notifications

Thanks @Falco

I’m mainly just stumped as to how the performance can bounce between completing ~11m and ~300k jobs in a day within ~1 week with the same configuration. A speed difference of ~35x in terms of jobs per second.

For the CPU usage, it’s back down to ~15-20% use which is about the usual. Processing jobs at the same speed (slow).

Just to clarify/confirm, I meant assigning (not adding) some sidekiqs to exclusively process the low priority queue, as it appeared like the low priority tasks can be processed at a much faster rate and possibly doesn’t suffer the same bottlenecks. I was speculating this might explain how the jobs per second can vary so drastically (i.e. low priority ‘easy’ tasks stuck behind the default queue backlog).

To clarify - do you think that postgresql performance is causing the slow job completion or just the high CPU usage event I noticed yesterday (that is now back to normal)?

1 Like

This is all on SSD, right?

2 Likes

Yes correct @Stephen - NVMe SSDs RAID 1.

1 Like

Update: I tried deleting the low priority and default queue a few times with no impact on speed as the default queue just grows again immediately. I then tried deleting the default queue and enabling read-only mode. This made the jobs per second spike dramatically blazing through the low priority queue (~100x jobs per second speed).

Edit: Seems that even with just a large low priority queue, the processing speed is still slow. If I set Discourse to read-only, then empty both low and default priority queues. The jobs processing afterwards seems to stay super fast emptying the scheduled tasks and queues until I disable read-only mode. :yuno:

My next step would be figuring out exactly which process is causing the trouble by going into the Discourse app and running htop or top to see the top CPU usage.

3 Likes

It does sound like postgres is the bottleneck. You might configure prometheus to track its performance and see that it’s getting access to enough ram.

2 Likes

Thanks for your input @pfaffman :slight_smile: I think db_shared_buffers and db_work_mem in the app.yml are the only controls for postgresql RAM access right?

Have tinkered a bit both upwards and downwards. Current settings in the app.yml are:
db_shared_buffers: “32768MB”
db_work_mem: “128MB”

With total system ram of 128gb.

I’ve also tried changing max_connections in /var/discourse/shared/standalone/postgres_data/postgresql.conf then rebuilding Discourse. Tried values above the default (100), 200 to 500. Currently set at 300. Not sure if modifying it there is actually changing the max connections value though.

I see these in the /var/discourse/templates/postgres.template.yml

db_synchronous_commit: “off”
db_shared_buffers: “256MB”
db_work_mem: “10MB”
db_default_text_search_config: “pg_catalog.english”
db_name: discourse
db_user: discourse
db_wal_level: minimal
db_max_wal_senders: 0
db_checkpoint_segments: 6
db_logging_collector: off
db_log_min_duration_statement: 100

1 Like

Thanks @bartv, following your suggestion I’ve been watching from inside the Discourse app via top. I’m seeing quite a lot of postmaster processes ran by the postgres user - amount of CPU usage varying. Screenshots represent extended periods of time with similar usage stats.

Using ~95% of 32 cores:

Using ~20%, lower cpu usage of postmaster.

Using ~6% cpu, while read-only mode was active.

1 Like

How big is your database? How many users do you have? How many new posts per day?

2 Likes

First thing you should do is run VACUUM ANALYZE; from the postgres console.

This might take a while to run; you might want to stop sidekiq temporarily to lighten the load while it works.

If that doesn’t help, we should enable pg_stat_statements and then check to see what queries are taking a huge amount of CPU.

4 Likes

@pfaffman

  • /var/discourse/shared/standalone/postgres_data folder is 170GB
  • 61.7k active users in the last 30 days (not sure on the absolute total)
  • ~50k to 80k new posts per day
2 Likes

Oh. That’s non trivial.

You should read up on postgres tuning. That level of performance is a bit beyond the typical self hosting that’s seen here.

I would try giving postgres about 3/4 of the ram. I’d certainly split into separate data and web containers. But you might need some more complex postgres setup to give you the performance you need.

EDIT: But I don’t have experience with much larger databases, so see below! :wink:

5 Likes

We do much larger databases with much less RAM and without nearly as much CPU usage.

The pg_stat_statements information will likely be able to tell us what’s wrong.

8 Likes

Thanks a lot for the help guys.

So I tried running VACUUM ANALYZE; - no luck unfortunately. Commands used below, for reference:

cd /var/discourse/
./launcher enter app
sudo -u postgres psql
\c discourse
VACUUM ANALYZE;

I attempted to enable pg_stat_statements, steps done below:

Added/modified the lines below here: /var/discourse/shared/standalone/postgres_data/postgresql.conf

shared_preload_libraries = ‘pg_stat_statements’ # (change requires restart)
pg_stat_statements.max = 10000
pg_stat_statements.track = all
pg_stat_statements.track_utility = on
pg_stat_statements.save = off

Then rebuilt discourse and ran:

./launcher enter app
sudo -u postgres psql
\c discourse
CREATE EXTENSION pg_stat_statements:

I’ve tried to do queries, but I get this error:

ERROR: pg_stat_statements must be loaded via shared_preload_libraries

My guess is that my edits to the postgresql.conf file (/var/discourse/shared/standalone/postgres_data/postgresql.conf) are not working (I rebuilt Discourse after editing). Is it possible to make these edits via the app.yml file? Or see anything I’ve done wrong?

1 Like

Rebuilding discourse will erase those changes. Restarting the container will probably do the trick. (something like sv restart postgres inside the container might also do it).

2 Likes

Thanks, I tried restarting the container:

./launcher stop app
./launcher start app

Still get the same error when attempting to query:

ERROR: pg_stat_statements must be loaded via shared_preload_libraries

The changes I made previously are still persisting in the file here even through a rebuild:

/var/discourse/shared/standalone/postgres_data/postgresql.conf

I’m suspecting that this file isn’t where I should be making these edits :face_with_monocle:

Maybe due to only using just app.yml (and mail-receiver.yml) in the containers folder, have not implemented using data.yml.

Without actually looking, It’s likely /etc/postgres inside the container. You might also need to install whatever library it’s requiring.

2 Likes

Thanks, this helped a lot, at first. :man_cartwheeling:. The issue seemed solved and the jobs queue was going really fast from just increasing the max connections in the postgresql.conf. Unforunately it slowed down again after about a day.

Putting steps below, in case they are useful to others wanting to increase the postgresql max_connections.

docker ps

Get the container ID, e.g. aaabbbccc123, replace in the commands below:

Copy the postgresql.conf file from inside the docker container to the local file system:

docker cp aaabbbccc123:/etc/postgresql/10/main/postgresql.conf /srv

Edit the configuration:

nano /srv/postgresql.conf

Copy it back into the docker container:

docker cp /srv/postgresql.conf aaabbbccc123:/etc/postgresql/10/main/postgresql.conf

cd /var/discourse
./launcher stop app
./launcher start app

Delete the left over file (optional):

rm /srv/postgresql.conf

1 Like

@supermathie I believe I’ve successfully enabled pg_stat_statements :grinning:

I tried using this query:

SELECT
(total_time / 1000 / 60) as total,
(total_time/calls) as avg,
query
FROM pg_stat_statements
ORDER BY 1 DESC
LIMIT 100;

From this guide: The most useful Postgres extension: pg_stat_statements

I can’t really read the result though, I think I’ve done something wrong.

total | avg | query
1671.1110420745 | 374.736186677194 | SELECT COUNT(*) FROM ( +
| | SELECT $1 FROM +
| | notifications n +
| | LEFT JOIN topics t ON t.id = n.topic_id +
| | WHERE t.deleted_at IS NULL AND +
| | n.notification_type <> $2 AND +
| | n.user_id = $3 AND +
| | n.id > $4 AND

Now that you know what you want to do You can make these changes with a replace stanza in your app.yml.

Also, you could just ./launcher enter app and edit the files directly. Note, though, that when you rebuild those changes will not be in the new container.

2 Likes