Could sidekiq queue be reason for 500 errors?

Hmm, I would have expected it to start working again. Something is not right… :thinking:

Sidekiq should be working in the background even though the server returns errors. An increasing number of enqueued jobs could be normal, as long the the count of processed jobs increased over night as well. Did it?

You could try taking a look at Sidekiq in the rails console:

./launcher enter app
rails c

Sidekiq::Stats.new

You can take a look at API · mperham/sidekiq Wiki · GitHub if you want to know what else you can do with the Sidekiq API.

2 Likes

The processed jobs aren’t increasing whilst the server is down. Each time I do a reboot / rebuild the process jobs increase for the 25-30 seconds the server is up, then they stop. When I do another reboot / rebuild cycle the number of processed jobs is the same as when the server crashes.

When I run

./launcher enter app
rails c

I see the same MISCONF Redis error:

# rails c

bundler: failed to load command: pry (/var/www/discourse/vendor/bundle/ruby/2.4.0/bin/pry)

Redis::CommandError: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Would exporting the Discourse database and building another fresh install, then importing the data be a sensible route?

Not sure if moving to a fresh server will help. You might run into the same issues, because you will need to rebake all posts.

I’d rather try to temporarily upgrade your server to give it more memory. Did you enable swap on your server? If not, you should. Also, take a look at https://meta.discourse.org/t/tons-of-redis-errors/48402

2 Likes

Upgrading to an 8GB / 4CPU droplet.

Because it’s running on SSD is enabling swap a good idea? Thanks for the post, yes I’ve read that one a few times now.

It’s looking like memory may be the issues:

# free -m
              total        used        free      shared  buff/cache   available
Mem:           3951        2456         132         541        1362         648
Swap:             0           0           0

@gerhard thank you again for your assistance! Upgrading the RAM has got the site running, feels like such an obvious solution! I was relying on the DO dashboard that was telling me the servers RAM was only at 60% capacity, lesson learned there!

It’s been running for 30 minutes without issue and sidekiq has worked through 20k of the 1.5m queued items!

3 Likes

I’m glad that it’s working now.

Yes, we always recommend enabling swap. Create a swapfile for your Linux server

3 Likes

This seems extreme now! We’re running an 8 GB / 4 core DO droplet that’s getting no traffic and the site keeps falling over. Is this likely to be down to the sidekiq queue? The queue is currently at 3.1 million records and growing, it’s processed 600k in the last 36 hours with all our outages.

# free -m
              total        used        free      shared  buff/cache   available
Mem:           7983        4728         132        1622        3122        1338
Swap:          2047          30        2017

We’re running UNICORN_WORKERS 8

At it’s current process rate of 10 queued items a second, provided the serve stays up :crossed_fingers: we’re looking at a further 86 hours. We imported around 220,000 posts, do these numbers/times seem normal?

It seems high to me. Did you determine what the newly created jobs are? Do you have a lot of failed jobs? That might point you to an issue too.

Hey @bartv thanks for the reply.

The Critical queue has around 710k items, all of which are similar to:

{"user_id"=>428778, "topic_id"=>433002, "input"=>"topic_notification_level_changed", "current_site_id"=>"default"}

And the Default queue has around 2.4 million items with various arguments:

{"post_id"=>453284, "bypass_bump"=>false, "cooking_options"=>nil, "current_site_id"=>"default"}
{"topic_id"=>434377, "current_site_id"=>"default"}
{"post_id"=>453284, "new_record"=>true, "options"=>{"skip_send_email"=>true}, "current_site_id"=>"default"}

There’s only 347 that have failed.

I’m not familiar with the topic_notification_level_changed job. Is that the one that’s keeping your Unicorn workers busy? 10 jobs/second is very slow - I only experienced that when rebaking posts with a lot of graphics in them that needed resizing.

TBH I’m not sure how to check what’s keeping the Unicorn Workers busy? We didn’t include any images in the import, so this sounds a little concerning.

Click on the ‘Busy’ tab at the top of the /sidekiq screen and you’ll see your queues and the jobs inside them. Each job also shows how long it’s been active, which is a great indicator of problems.

I assume your Critical queue jobs are getting handled first, but let’s confirm that these are indeed the jobs that are causing the slowness.

2 Likes

That was one of those it can’t be that easy moments :rofl:

OK, so it looks like things are processing, nothing seems to be obvious in terms of hold ups:

1 Like

And now the sites gone down again :confused:

# free -m
              total        used        free      shared  buff/cache   available
Mem:           7983        4829         128        2116        3025         755
Swap:          2047          72        1975

Looks like you don’t have any seriously slow tasks there… What’s your CPU load like during the processing? If it’s low you can try increasing UNICORN_SIDEKIQS. It’s currently set to 1 for you, adding more will add 5 job processors at a time.

In contrast, the UNICORN_WORKERS setting affects the number of concurrent web requests that can be handled - this is not related to Sidekiq and increasing the value won’t help solve this issue.

Do you see anything useful in the logs? They’re located in /var/discourse/shared/standalone/log

1 Like

Thanks again @bartv, I’ve added UNICORN_SIDEKIQS=5 to the app.yml and run ./launcher restart app now looking back at the sidekiq dashboard it’s still only processing around 10 per second.

Have I got UNICORN_SIDEKIQS=5 correct or should it be UNICORN_SIDEKIQS: 5

The logs are showing thousands of entries for:

Started GET "/sidekiq/stats" for 86.1.10.29 at 2018-06-13 10:34:22 +0000

It should be UNICORN_SIDEKIQS: 5 - the same formatting as any other setting in app.yml. You can verify this by going to the busy tab in Sidekiq again - the number of processes should match the value you entered here.

And a tip: to quickly update these settings you don’t need to do a full rebuild; just do this:

./launcher destroy app
./launcher start app
2 Likes

OK so I updated the Unicorn Sidekiqs to 5 and this temporarily doubled the speed to around 10 per second, until the server fell over again.

# free -m
              total        used        free      shared  buff/cache   available
Mem:           7983        6086         125         971        1771         629
Swap:          2047          46        2001

I’ll try adjusting the number to see if I can get a stable increase without the server bugging out.

I really urge you to inspect your log files after your server crashes; they might provide actionable information.

1 Like

I see this error 10,000’s times in /var/discourse/shared/standalone/log/rails/production.log

As well as a very similar message, again thousands of times over, in /var/discourse/shared/standalone/log/rails/unicorn.stderr.log