Very slow Sidekiq issue with large queue due to massive numbers of unread user notifications

So I cleared out the low priority queue another time and left it for a couple of days (no changes since my last update) - it didn’t speed up immediately and had queued jobs piling up rapidly, but seems to have fixed itself given some time. The jobs processing is going blazing fast now. :slight_smile: Using a 20s polling interval, seeing a range of 55 to 140 jobs per second over the last few minutes. Per day looks healthy too, no queue build up.

Thanks a lot for the help @Falco @supermathie @Stephen, I really appreciated it!

Regarding your questions, I’m not so sure how to check those. I’d be happy to check (would need some guidance) and provide the info if it’s still helpful though. Something possibly relevant is that I’ve had the ‘max emails per day per user’ setting set to 3 for a long time.

4 Likes

I may have spoken too soon. Sidekiq jobs are currently running at ~1 to 3 per second with 8.81m queue.

:philosoraptor:

1 Like

When did you last update? I added some performance improvements to the PostAlert job a few days ago:

Some of our very large sites were seeing performance issues for categories with lots of people “watching first post”. This commit has resolved the issue on our hosting, so there’s a chance it could help your site as well.

6 Likes

Great! I’m updating now, last updated ~10 days ago or so (tests-passed). Will monitor and see if there is an improvement, then report back. Thanks!

4 Likes

Update: No immediate improvements to speed since updating unfortunately. Will see if it improves with some time.

4 Likes

Update: Still running slow and the queue is building up. See a lot of postmaster processes via ‘top’. ~85% total cpu usage (32 cores), the vast majority of that being from postmaster. Which is interesting, as earlier today the cpu usage was 20-35% (sidekiq was still moving slow at that time also). Related: Postmaster eating all CPU

1 Like

Think these redis warnings could have something to do with it? They are displayed during app rebuild:

# WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.

# WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.

Has anyone fixed these errors for a docker install?

I already added vm.overcommit_memory = 1 to /etc/sysctl.conf to fix the overcommit memory warning.

1 Like

So I fixed the Transparent Huge Pages (THP) warning by just running ’ echo never > /sys/kernel/mm/transparent_hugepage/enabled’ as root. I didn’t add it too rc.local for persistence, yet - just for testing. Did a Discourse rebuild, performance is about the same - maybe a slight improvement.

Not so sure how to fix this warning though:
# WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.

Seeing some people saying that Docker will still use the value 128 even if the system value is set higher i.e. via a guide like this: Performance tips for Redis Cache Server – Tech and Me

1 Like

I’m thinking it could be a good idea to assign some UNICORN_SIDEKIQS specifically to the low priority queue.

Seems like the ‘default’ priority tasks i.e. PostAlert are moving quite slow and once there is a backlog of these slow default priority tasks, the low priority queue (with tasks that could be completed at a significantly faster rate) balloons as almost none of them appear to get completed. I’m suspecting that this ballooning makes the overall queue processing of all tasks slower. I think this could possibly explain the large fluctuation in jobs per second also.

Does anyone know if it’s possible to assign UNICORN_SIDEKIQS in the app.yml file (or some other way) to specific priority tasks?

Adding more Sidekiqs while your database is a bottleneck will only make it worse.

Like I said above your need to debug the PostgreSQL bad performance problem.

10 Likes

Thanks @Falco

I’m mainly just stumped as to how the performance can bounce between completing ~11m and ~300k jobs in a day within ~1 week with the same configuration. A speed difference of ~35x in terms of jobs per second.

For the CPU usage, it’s back down to ~15-20% use which is about the usual. Processing jobs at the same speed (slow).

Just to clarify/confirm, I meant assigning (not adding) some sidekiqs to exclusively process the low priority queue, as it appeared like the low priority tasks can be processed at a much faster rate and possibly doesn’t suffer the same bottlenecks. I was speculating this might explain how the jobs per second can vary so drastically (i.e. low priority ‘easy’ tasks stuck behind the default queue backlog).

To clarify - do you think that postgresql performance is causing the slow job completion or just the high CPU usage event I noticed yesterday (that is now back to normal)?

1 Like

This is all on SSD, right?

2 Likes

Yes correct @Stephen - NVMe SSDs RAID 1.

1 Like

Update: I tried deleting the low priority and default queue a few times with no impact on speed as the default queue just grows again immediately. I then tried deleting the default queue and enabling read-only mode. This made the jobs per second spike dramatically blazing through the low priority queue (~100x jobs per second speed).

Edit: Seems that even with just a large low priority queue, the processing speed is still slow. If I set Discourse to read-only, then empty both low and default priority queues. The jobs processing afterwards seems to stay super fast emptying the scheduled tasks and queues until I disable read-only mode. :yuno:

My next step would be figuring out exactly which process is causing the trouble by going into the Discourse app and running htop or top to see the top CPU usage.

3 Likes

It does sound like postgres is the bottleneck. You might configure prometheus to track its performance and see that it’s getting access to enough ram.

2 Likes

Thanks for your input @pfaffman :slight_smile: I think db_shared_buffers and db_work_mem in the app.yml are the only controls for postgresql RAM access right?

Have tinkered a bit both upwards and downwards. Current settings in the app.yml are:
db_shared_buffers: “32768MB”
db_work_mem: “128MB”

With total system ram of 128gb.

I’ve also tried changing max_connections in /var/discourse/shared/standalone/postgres_data/postgresql.conf then rebuilding Discourse. Tried values above the default (100), 200 to 500. Currently set at 300. Not sure if modifying it there is actually changing the max connections value though.

I see these in the /var/discourse/templates/postgres.template.yml

db_synchronous_commit: “off”
db_shared_buffers: “256MB”
db_work_mem: “10MB”
db_default_text_search_config: “pg_catalog.english”
db_name: discourse
db_user: discourse
db_wal_level: minimal
db_max_wal_senders: 0
db_checkpoint_segments: 6
db_logging_collector: off
db_log_min_duration_statement: 100

1 Like

Thanks @bartv, following your suggestion I’ve been watching from inside the Discourse app via top. I’m seeing quite a lot of postmaster processes ran by the postgres user - amount of CPU usage varying. Screenshots represent extended periods of time with similar usage stats.

Using ~95% of 32 cores:

Using ~20%, lower cpu usage of postmaster.

Using ~6% cpu, while read-only mode was active.

1 Like

How big is your database? How many users do you have? How many new posts per day?

2 Likes

First thing you should do is run VACUUM ANALYZE; from the postgres console.

This might take a while to run; you might want to stop sidekiq temporarily to lighten the load while it works.

If that doesn’t help, we should enable pg_stat_statements and then check to see what queries are taking a huge amount of CPU.

4 Likes