I’m mainly just stumped as to how the performance can bounce between completing ~11m and ~300k jobs in a day within ~1 week with the same configuration. A speed difference of ~35x in terms of jobs per second.
For the CPU usage, it’s back down to ~15-20% use which is about the usual. Processing jobs at the same speed (slow).
Just to clarify/confirm, I meant assigning (not adding) some sidekiqs to exclusively process the low priority queue, as it appeared like the low priority tasks can be processed at a much faster rate and possibly doesn’t suffer the same bottlenecks. I was speculating this might explain how the jobs per second can vary so drastically (i.e. low priority ‘easy’ tasks stuck behind the default queue backlog).
To clarify - do you think that postgresql performance is causing the slow job completion or just the high CPU usage event I noticed yesterday (that is now back to normal)?
Update: I tried deleting the low priority and default queue a few times with no impact on speed as the default queue just grows again immediately. I then tried deleting the default queue and enabling read-only mode. This made the jobs per second spike dramatically blazing through the low priority queue (~100x jobs per second speed).
Edit: Seems that even with just a large low priority queue, the processing speed is still slow. If I set Discourse to read-only, then empty both low and default priority queues. The jobs processing afterwards seems to stay super fast emptying the scheduled tasks and queues until I disable read-only mode.
My next step would be figuring out exactly which process is causing the trouble by going into the Discourse app and running htop or top to see the top CPU usage.
Thanks for your input @pfaffman I think db_shared_buffers and db_work_mem in the app.yml are the only controls for postgresql RAM access right?
Have tinkered a bit both upwards and downwards. Current settings in the app.yml are:
db_shared_buffers: “32768MB”
db_work_mem: “128MB”
With total system ram of 128gb.
I’ve also tried changing max_connections in /var/discourse/shared/standalone/postgres_data/postgresql.conf then rebuilding Discourse. Tried values above the default (100), 200 to 500. Currently set at 300. Not sure if modifying it there is actually changing the max connections value though.
I see these in the /var/discourse/templates/postgres.template.yml
Thanks @bartv, following your suggestion I’ve been watching from inside the Discourse app via top. I’m seeing quite a lot of postmaster processes ran by the postgres user - amount of CPU usage varying. Screenshots represent extended periods of time with similar usage stats.
You should read up on postgres tuning. That level of performance is a bit beyond the typical self hosting that’s seen here.
I would try giving postgres about 3/4 of the ram. I’d certainly split into separate data and web containers. But you might need some more complex postgres setup to give you the performance you need.
EDIT: But I don’t have experience with much larger databases, so see below!
ERROR: pg_stat_statements must be loaded via shared_preload_libraries
My guess is that my edits to the postgresql.conf file (/var/discourse/shared/standalone/postgres_data/postgresql.conf) are not working (I rebuilt Discourse after editing). Is it possible to make these edits via the app.yml file? Or see anything I’ve done wrong?
Rebuilding discourse will erase those changes. Restarting the container will probably do the trick. (something like sv restart postgres inside the container might also do it).
Thanks, this helped a lot, at first. . The issue seemed solved and the jobs queue was going really fast from just increasing the max connections in the postgresql.conf. Unforunately it slowed down again after about a day.
Putting steps below, in case they are useful to others wanting to increase the postgresql max_connections.
docker ps
Get the container ID, e.g. aaabbbccc123, replace in the commands below:
Copy the postgresql.conf file from inside the docker container to the local file system:
I can’t really read the result though, I think I’ve done something wrong.
total | avg | query
1671.1110420745 | 374.736186677194 | SELECT COUNT(*) FROM ( +
| | SELECT $1 FROM +
| | notifications n +
| | LEFT JOIN topics t ON t.id = n.topic_id +
| | WHERE t.deleted_at IS NULL AND +
| | n.notification_type <> $2 AND +
| | n.user_id = $3 AND +
| | n.id > $4 AND
Now that you know what you want to do You can make these changes with a replace stanza in your app.yml.
Also, you could just ./launcher enter app and edit the files directly. Note, though, that when you rebuild those changes will not be in the new container.