Very slow Sidekiq issue with large queue due to massive numbers of unread user notifications

So have an issue with sidekiq.

It will run amazingly fast through jobs when monitoring via the sidekiq web ui. But occasionally it appears like it gets overwhelmed and starts running extremely slow. Running at 1-5% or so of it’s normal speed and does not recover unless I flush redis, despite server resource usage being fine/low.

It appears like once the queue hits a certain size, it seizes up and slows down drastically. Causing the queue to grow even more. I’m just guessing here though, maybe the queue is just large due to it slowing down for some other reason.

This gif describes what it looks like to me.

simpsons_EDIFIL20161024_0001

There are plenty of server resources available, CPU usage is very low right now - under 10%. Plenty of ram and ssd available also. Regarding the server, it has 16 CPU cores with 32 threads. I’ve tried running between 8-14 unicorn_sidekiqs. I also tried 20, but that created a lot of 5xx errors.

I was able to speed up slow jobs displayed on the ‘busy’ tab of the sidekiq web ui using
Could sidekiq queue be reason for 500 errors? (
adding ‘vm.overcommit_memory = 1’ to /etc/sysctl.conf file and rebooting) and also decreasing unicon_sidekiqs down to 8 (from 12).

It’s still running slow though. I did see this in the redis log yesterday (the only other warning was regarding not having overcommit_memory set to 1, which I modified above):

# WARNING: /proc/sys/net/core/somaxconn is set to the lower value of 128

^ Has anyone fixed this warning above?

Anyhow, if anyone has any ideas as to what could be the cause and/or fix - please let me know. I’d appreciate it.

Would be really great to resolve this issue so it doesn’t happen again, rather than flushing.

Here is a screenshot of what I’m seeing on the sidekiq dashboard:

And some screenshots of the jobs under the busy tab:

Also, does anyone know if it is safe to use this option? Deleting the low priority queue from the sidekiq web ui?

1 Like

Update: I deleted the low priority queue without issue, however the job processing speed has remained the same.

1 Like

Do you have metrics on how long your jobs are taking? This looks you have a huge amount of contention on your PostAlert jobs, but others are completed quickly.

5 Likes

Judging from what I’ve observed in the sidekiq web ui. Yes you’re right, other jobs seem to be completed quickly with the exception of:

Jobs::PostAlert - 0 to 3 minutes, with the majority being in the 0 to 1 minute range.
Jobs::ProcessPost - 0 to 21 seconds.

1 Like

Is your SMTP server slow?

6 Likes

I’m using Amazon SES for sending and have also configured the mail receiver for receiving VERP.

The sending limit displayed on SES is 25 emails/second. Is this too slow? I can probably request it to be increased.

Now that you mention it, I have seen a correlation with this issue starting on a day when a larger than normal amount of digest emails get sent out (many digest emails were consolidated onto a single day due to a configuration issue in the past).

4 Likes

How many users are you emailing? What do the mail volumes look like?

4 Likes

Not sure about how many users are being emailed. The last 30 day active users stat from the admin dashboard is 60.8k, perhaps that is an indicator? Here are the sending stats from SES (100k+ 24hr limit):

1 Like

Update: Had the SES per second sending rate limit increased from 25 to 50. So can now send at a speed of 180k emails per hour (although the total allowed per day is just over 100k). The sidekiq job processing speed doesn’t appear to have improved however.

4 Likes

We had a problem a couple of years ago with users having 10k unread notifications which would make notifications queries slow, and in turn make the PostAlert job slow.

We added a protection so it doesn’t happen anymore so much, but it may present a different performance characteristics on your setup.

Do you have users set to watch categories who are oblivious to notifications count?

Can you check the max number of unread notifications per user in your database?

9 Likes

So I cleared out the low priority queue another time and left it for a couple of days (no changes since my last update) - it didn’t speed up immediately and had queued jobs piling up rapidly, but seems to have fixed itself given some time. The jobs processing is going blazing fast now. :slight_smile: Using a 20s polling interval, seeing a range of 55 to 140 jobs per second over the last few minutes. Per day looks healthy too, no queue build up.

Thanks a lot for the help @Falco @supermathie @Stephen, I really appreciated it!

Regarding your questions, I’m not so sure how to check those. I’d be happy to check (would need some guidance) and provide the info if it’s still helpful though. Something possibly relevant is that I’ve had the ‘max emails per day per user’ setting set to 3 for a long time.

4 Likes

I may have spoken too soon. Sidekiq jobs are currently running at ~1 to 3 per second with 8.81m queue.

:philosoraptor:

1 Like

When did you last update? I added some performance improvements to the PostAlert job a few days ago:

https://github.com/discourse/discourse/commit/db4ae509288340ba30f2ecd84bb13d7cc41dedcb

Some of our very large sites were seeing performance issues for categories with lots of people “watching first post”. This commit has resolved the issue on our hosting, so there’s a chance it could help your site as well.

6 Likes

Great! I’m updating now, last updated ~10 days ago or so (tests-passed). Will monitor and see if there is an improvement, then report back. Thanks!

4 Likes

Update: No immediate improvements to speed since updating unfortunately. Will see if it improves with some time.

4 Likes

Update: Still running slow and the queue is building up. See a lot of postmaster processes via ‘top’. ~85% total cpu usage (32 cores), the vast majority of that being from postmaster. Which is interesting, as earlier today the cpu usage was 20-35% (sidekiq was still moving slow at that time also). Related: Postmaster eating all CPU

1 Like

Think these redis warnings could have something to do with it? They are displayed during app rebuild:

# WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.

# WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.

Has anyone fixed these errors for a docker install?

I already added vm.overcommit_memory = 1 to /etc/sysctl.conf to fix the overcommit memory warning.

1 Like

So I fixed the Transparent Huge Pages (THP) warning by just running ’ echo never > /sys/kernel/mm/transparent_hugepage/enabled’ as root. I didn’t add it too rc.local for persistence, yet - just for testing. Did a Discourse rebuild, performance is about the same - maybe a slight improvement.

Not so sure how to fix this warning though:
# WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.

Seeing some people saying that Docker will still use the value 128 even if the system value is set higher i.e. via a guide like this: Performance tips for Redis Cache Server – Tech and Me

1 Like

I’m thinking it could be a good idea to assign some UNICORN_SIDEKIQS specifically to the low priority queue.

Seems like the ‘default’ priority tasks i.e. PostAlert are moving quite slow and once there is a backlog of these slow default priority tasks, the low priority queue (with tasks that could be completed at a significantly faster rate) balloons as almost none of them appear to get completed. I’m suspecting that this ballooning makes the overall queue processing of all tasks slower. I think this could possibly explain the large fluctuation in jobs per second also.

Does anyone know if it’s possible to assign UNICORN_SIDEKIQS in the app.yml file (or some other way) to specific priority tasks?

Adding more Sidekiqs while your database is a bottleneck will only make it worse.

Like I said above your need to debug the PostgreSQL bad performance problem.

10 Likes