Very slow Sidekiq issue with large queue due to massive numbers of unread user notifications

markersocial · 5. Februar 2020 um 12:07

So have an issue with sidekiq.

It will run amazingly fast through jobs when monitoring via the sidekiq web ui. But occasionally it appears like it gets overwhelmed and starts running extremely slow. Running at 1-5% or so of it’s normal speed and does not recover unless I flush redis, despite server resource usage being fine/low.

It appears like once the queue hits a certain size, it seizes up and slows down drastically. Causing the queue to grow even more. I’m just guessing here though, maybe the queue is just large due to it slowing down for some other reason.

This gif describes what it looks like to me.

simpsons_EDIFIL20161024_0001

There are plenty of server resources available, CPU usage is very low right now - under 10%. Plenty of ram and ssd available also. Regarding the server, it has 16 CPU cores with 32 threads. I’ve tried running between 8-14 unicorn_sidekiqs. I also tried 20, but that created a lot of 5xx errors.

I was able to speed up slow jobs displayed on the ‘busy’ tab of the sidekiq web ui using
Could sidekiq queue be reason for 500 errors? (
adding ‘vm.overcommit_memory = 1’ to /etc/sysctl.conf file and rebooting) and also decreasing unicon_sidekiqs down to 8 (from 12).

It’s still running slow though. I did see this in the redis log yesterday (the only other warning was regarding not having overcommit_memory set to 1, which I modified above):

# WARNING: /proc/sys/net/core/somaxconn is set to the lower value of 128

^ Has anyone fixed this warning above?

Anyhow, if anyone has any ideas as to what could be the cause and/or fix - please let me know. I’d appreciate it.

Would be really great to resolve this issue so it doesn’t happen again, rather than flushing.

Here is a screenshot of what I’m seeing on the sidekiq dashboard:

And some screenshots of the jobs under the busy tab:

Also, does anyone know if it is safe to use this option? Deleting the low priority queue from the sidekiq web ui?

markersocial · 5. Februar 2020 um 17:50

Update: I deleted the low priority queue without issue, however the job processing speed has remained the same.

supermathie · 5. Februar 2020 um 18:10

Do you have metrics on how long your jobs are taking? This looks you have a huge amount of contention on your PostAlert jobs, but others are completed quickly.

markersocial · 5. Februar 2020 um 18:38

Judging from what I’ve observed in the sidekiq web ui. Yes you’re right, other jobs seem to be completed quickly with the exception of:

Jobs::PostAlert - 0 to 3 minutes, with the majority being in the 0 to 1 minute range.
Jobs::ProcessPost - 0 to 21 seconds.

Falco · 5. Februar 2020 um 18:59

Is your SMTP server slow?

markersocial · 6. Februar 2020 um 04:49

I’m using Amazon SES for sending and have also configured the mail receiver for receiving VERP.

The sending limit displayed on SES is 25 emails/second. Is this too slow? I can probably request it to be increased.

Now that you mention it, I have seen a correlation with this issue starting on a day when a larger than normal amount of digest emails get sent out (many digest emails were consolidated onto a single day due to a configuration issue in the past).

Stephen · 6. Februar 2020 um 05:33

How many users are you emailing? What do the mail volumes look like?

markersocial · 6. Februar 2020 um 07:13

Not sure about how many users are being emailed. The last 30 day active users stat from the admin dashboard is 60.8k, perhaps that is an indicator? Here are the sending stats from SES (100k+ 24hr limit):

markersocial · 7. Februar 2020 um 09:08

Update: Had the SES per second sending rate limit increased from 25 to 50. So can now send at a speed of 180k emails per hour (although the total allowed per day is just over 100k). The sidekiq job processing speed doesn’t appear to have improved however.

Falco · 7. Februar 2020 um 12:43

We had a problem a couple of years ago with users having 10k unread notifications which would make notifications queries slow, and in turn make the PostAlert job slow.

We added a protection so it doesn’t happen anymore so much, but it may present a different performance characteristics on your setup.

Do you have users set to watch categories who are oblivious to notifications count?

Can you check the max number of unread notifications per user in your database?

markersocial · 9. Februar 2020 um 14:11

So I cleared out the low priority queue another time and left it for a couple of days (no changes since my last update) - it didn’t speed up immediately and had queued jobs piling up rapidly, but seems to have fixed itself given some time. The jobs processing is going blazing fast now. Using a 20s polling interval, seeing a range of 55 to 140 jobs per second over the last few minutes. Per day looks healthy too, no queue build up.

Thanks a lot for the help @Falco @supermathie @Stephen, I really appreciated it!

Regarding your questions, I’m not so sure how to check those. I’d be happy to check (would need some guidance) and provide the info if it’s still helpful though. Something possibly relevant is that I’ve had the ‘max emails per day per user’ setting set to 3 for a long time.

markersocial · 10. Februar 2020 um 12:04

I may have spoken too soon. Sidekiq jobs are currently running at ~1 to 3 per second with 8.81m queue.

david · 10. Februar 2020 um 12:29

When did you last update? I added some performance improvements to the PostAlert job a few days ago:

https://github.com/discourse/discourse/commit/db4ae509288340ba30f2ecd84bb13d7cc41dedcb

Some of our very large sites were seeing performance issues for categories with lots of people “watching first post”. This commit has resolved the issue on our hosting, so there’s a chance it could help your site as well.

markersocial · 10. Februar 2020 um 12:38

Great! I’m updating now, last updated ~10 days ago or so (tests-passed). Will monitor and see if there is an improvement, then report back. Thanks!

markersocial · 10. Februar 2020 um 13:20

Update: No immediate improvements to speed since updating unfortunately. Will see if it improves with some time.

markersocial · 11. Februar 2020 um 04:07

Update: Still running slow and the queue is building up. See a lot of postmaster processes via ‘top’. ~85% total cpu usage (32 cores), the vast majority of that being from postmaster. Which is interesting, as earlier today the cpu usage was 20-35% (sidekiq was still moving slow at that time also). Related: Postmaster eating all CPU

markersocial · 11. Februar 2020 um 09:43

Think these redis warnings could have something to do with it? They are displayed during app rebuild:

# WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.

# WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.

Has anyone fixed these errors for a docker install?

I already added vm.overcommit_memory = 1 to /etc/sysctl.conf to fix the overcommit memory warning.

markersocial · 11. Februar 2020 um 10:47

So I fixed the Transparent Huge Pages (THP) warning by just running ’ echo never > /sys/kernel/mm/transparent_hugepage/enabled’ as root. I didn’t add it too rc.local for persistence, yet - just for testing. Did a Discourse rebuild, performance is about the same - maybe a slight improvement.

Not so sure how to fix this warning though:
# WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.

Seeing some people saying that Docker will still use the value 128 even if the system value is set higher i.e. via a guide like this: Performance tips for Redis Cache Server – Tech and Me

markersocial · 11. Februar 2020 um 16:02

I’m thinking it could be a good idea to assign some UNICORN_SIDEKIQS specifically to the low priority queue.

Seems like the ‘default’ priority tasks i.e. PostAlert are moving quite slow and once there is a backlog of these slow default priority tasks, the low priority queue (with tasks that could be completed at a significantly faster rate) balloons as almost none of them appear to get completed. I’m suspecting that this ballooning makes the overall queue processing of all tasks slower. I think this could possibly explain the large fluctuation in jobs per second also.

Does anyone know if it’s possible to assign UNICORN_SIDEKIQS in the app.yml file (or some other way) to specific priority tasks?

Falco · 11. Februar 2020 um 18:50

Adding more Sidekiqs while your database is a bottleneck will only make it worse.

Like I said above your need to debug the PostgreSQL bad performance problem.

Thema		Antworten	Aufrufe
Slow Sidekiq + Postmaster using 95%+ CPU (32 cores) after Postgresql Version Upgrade Installation server-resources	23	2930	28. Juni 2020
Notifications are coming again even after reading them [Private Topics plugin] Support notifications	20	1330	1. Dezember 2023
Could sidekiq queue be reason for 500 errors? Installation server-resources	31	3857	13. Juli 2018
Long-Running Sidekiq Jobs Feature	21	1628	24. Dezember 2020
Sidekiq has a lot of errors and queued jobs Support	19	1092	1. März 2024

Very slow Sidekiq issue with large queue due to massive numbers of unread user notifications

Verwandte Themen