Could sidekiq queue be reason for 500 errors?

Upgrading to an 8GB / 4CPU droplet.

Because it’s running on SSD is enabling swap a good idea? Thanks for the post, yes I’ve read that one a few times now.

It’s looking like memory may be the issues:

# free -m
              total        used        free      shared  buff/cache   available
Mem:           3951        2456         132         541        1362         648
Swap:             0           0           0

@gerhard thank you again for your assistance! Upgrading the RAM has got the site running, feels like such an obvious solution! I was relying on the DO dashboard that was telling me the servers RAM was only at 60% capacity, lesson learned there!

It’s been running for 30 minutes without issue and sidekiq has worked through 20k of the 1.5m queued items!

3 Likes

I’m glad that it’s working now.

Yes, we always recommend enabling swap. Create a swapfile for your Linux server

3 Likes

This seems extreme now! We’re running an 8 GB / 4 core DO droplet that’s getting no traffic and the site keeps falling over. Is this likely to be down to the sidekiq queue? The queue is currently at 3.1 million records and growing, it’s processed 600k in the last 36 hours with all our outages.

# free -m
              total        used        free      shared  buff/cache   available
Mem:           7983        4728         132        1622        3122        1338
Swap:          2047          30        2017

We’re running UNICORN_WORKERS 8

At it’s current process rate of 10 queued items a second, provided the serve stays up :crossed_fingers: we’re looking at a further 86 hours. We imported around 220,000 posts, do these numbers/times seem normal?

It seems high to me. Did you determine what the newly created jobs are? Do you have a lot of failed jobs? That might point you to an issue too.

Hey @bartv thanks for the reply.

The Critical queue has around 710k items, all of which are similar to:

{"user_id"=>428778, "topic_id"=>433002, "input"=>"topic_notification_level_changed", "current_site_id"=>"default"}

And the Default queue has around 2.4 million items with various arguments:

{"post_id"=>453284, "bypass_bump"=>false, "cooking_options"=>nil, "current_site_id"=>"default"}
{"topic_id"=>434377, "current_site_id"=>"default"}
{"post_id"=>453284, "new_record"=>true, "options"=>{"skip_send_email"=>true}, "current_site_id"=>"default"}

There’s only 347 that have failed.

I’m not familiar with the topic_notification_level_changed job. Is that the one that’s keeping your Unicorn workers busy? 10 jobs/second is very slow - I only experienced that when rebaking posts with a lot of graphics in them that needed resizing.

TBH I’m not sure how to check what’s keeping the Unicorn Workers busy? We didn’t include any images in the import, so this sounds a little concerning.

Click on the ‘Busy’ tab at the top of the /sidekiq screen and you’ll see your queues and the jobs inside them. Each job also shows how long it’s been active, which is a great indicator of problems.

I assume your Critical queue jobs are getting handled first, but let’s confirm that these are indeed the jobs that are causing the slowness.

2 Likes

That was one of those it can’t be that easy moments :rofl:

OK, so it looks like things are processing, nothing seems to be obvious in terms of hold ups:

1 Like

And now the sites gone down again :confused:

# free -m
              total        used        free      shared  buff/cache   available
Mem:           7983        4829         128        2116        3025         755
Swap:          2047          72        1975

Looks like you don’t have any seriously slow tasks there… What’s your CPU load like during the processing? If it’s low you can try increasing UNICORN_SIDEKIQS. It’s currently set to 1 for you, adding more will add 5 job processors at a time.

In contrast, the UNICORN_WORKERS setting affects the number of concurrent web requests that can be handled - this is not related to Sidekiq and increasing the value won’t help solve this issue.

Do you see anything useful in the logs? They’re located in /var/discourse/shared/standalone/log

1 Like

Thanks again @bartv, I’ve added UNICORN_SIDEKIQS=5 to the app.yml and run ./launcher restart app now looking back at the sidekiq dashboard it’s still only processing around 10 per second.

Have I got UNICORN_SIDEKIQS=5 correct or should it be UNICORN_SIDEKIQS: 5

The logs are showing thousands of entries for:

Started GET "/sidekiq/stats" for 86.1.10.29 at 2018-06-13 10:34:22 +0000

It should be UNICORN_SIDEKIQS: 5 - the same formatting as any other setting in app.yml. You can verify this by going to the busy tab in Sidekiq again - the number of processes should match the value you entered here.

And a tip: to quickly update these settings you don’t need to do a full rebuild; just do this:

./launcher destroy app
./launcher start app
2 Likes

OK so I updated the Unicorn Sidekiqs to 5 and this temporarily doubled the speed to around 10 per second, until the server fell over again.

# free -m
              total        used        free      shared  buff/cache   available
Mem:           7983        6086         125         971        1771         629
Swap:          2047          46        2001

I’ll try adjusting the number to see if I can get a stable increase without the server bugging out.

I really urge you to inspect your log files after your server crashes; they might provide actionable information.

1 Like

I see this error 10,000’s times in /var/discourse/shared/standalone/log/rails/production.log

As well as a very similar message, again thousands of times over, in /var/discourse/shared/standalone/log/rails/unicorn.stderr.log

What does the Redis log say? I had a similar issue with Redis running out of memory; the rebuild log provided the solution to this:

186:M 01 Jun 11:02:31.042 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
186:M 01 Jun 11:02:31.042 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.

Perform these commands then restart your Discourse:

sysctl vm.overcommit_memory=1
echo never > /sys/kernel/mm/transparent_hugepage/enabled

Note that you’ll still need to make these persistent! (See the quoted text above to learn how)

3 Likes

This seems to have helped! We’re five minutes in now and it’s holding steady at around 20 queued items per second with 5 unicorn Sidekiqs

I really appreciate your time on this!

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.