Could sidekiq queue be reason for 500 errors?

danmaby · June 11, 2018, 7:06pm

Hi, I’ve completed a migration from phpBB3 to Discourse. For several hours the new install has performed flawlessly, now suddenly the server keeps falling over and returning a 500 error. Each time I need to reboot the server, then rebuild the app to access the site again. The last few times of going through this process it’s displayed the site for just seconds before it falls over again.

The sidekiq queue has around 1,350,200 items to process. Would this be a potential cause for the server to fail? It’s running on a 4GB / 2 core DO droplet that’s averaging around 60% RAM usage when the site’s up.

Looking at the logs I see the following error has been logged 8 times:

Sidekiq heartbeat test failed, restarting

/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/logster-1.2.9/lib/logster/logger.rb:93:in `add_with_opts' /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/logster-1.2.9/lib/logster/logger.rb:50:in `add' /usr/local/lib/ruby/2.4.0/logger.rb:534:in `warn' config/unicorn.conf.rb:182:in `check_sidekiq_heartbeat' config/unicorn.conf.rb:199:in `master_sleep' /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/unicorn-5.4.0/lib/unicorn/http_server.rb:294:in `join' /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/unicorn-5.4.0/bin/unicorn:126:in `&lt;top (required)&gt;' /var/www/discourse/vendor/bundle/ruby/2.4.0/bin/unicorn:23:in `load' /var/www/discourse/vendor/bundle/ruby/2.4.0/bin/unicorn:23:in `&lt;main&gt;'`

Any help appreciated feel like I’m just hitting and hoping!

danmaby · June 11, 2018, 8:15pm

The server now fails as soon as it’s rebooted and I run ./launcher rebuild app

It has no traffic going to it and still only averaging 60% RAM usage. I’m at a loss as to where to start debugging this?

gerhard · June 11, 2018, 9:05pm

You will have to let sidekiq do its job. It can take quite a while to empty the queue and you might see some problems until the larger jobs got a chance to complete for the first time. After an import that’s to be expected. Rebooting won’t do you much good. The only thing that helps are more and faster CPUs. Then you could tweak the number of sidekiq workers… otherwise you’ll need lots of patience.

danmaby · June 11, 2018, 9:10pm

Thanks @gerhard (again!) the problem I’m facing though is the sites down until I run the reboot / rebuild. But then it goes down almost immediately, so sidekiq can’t run?

gerhard · June 11, 2018, 9:15pm

Let it run a couple of uninterrupted hours (e.g. over night). The site should start working again after some jobs finished in the background. I’ve seen the same behavior multiple times on some of the slower droplets.

danmaby · June 11, 2018, 9:16pm

I’ve run

tail shared/standalone/log/rails/production.log

and

more shared/standalone/log/rails/production.log

I’m seeing thousands of the following error:

Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Would this be related or am I staring at yet another issue? There’s plenty of memory available:

# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           396M  5.7M  390M   2% /run
/dev/vda1        78G   31G   47G  40% /
tmpfs           2.0G  1.1M  2.0G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
/dev/vda15      105M  3.4M  101M   4% /boot/efi
tmpfs           396M     0  396M   0% /run/user/0
none             78G   31G   47G  40% 
/var/lib/docker/aufs/mnt/9de2464358c18edd4a230e507ff946ad0254344dcf05dd5d9387cd5e49e71620 
shm             512M  8.0K  512M   1%

/var/lib/docker/containers/3449b43abc652f92f02b2c3a733f2e439104d4da266335ada9c0e715d68a8e3b/mounts/shm

pfaffman · June 12, 2018, 4:06am

Could be a RAM problem. How many posts and users do you have?

danmaby · June 12, 2018, 6:02am

@pfaffman the RAM is running at 60% average capacity when the site is up and peaking at around 80%

danmaby · June 12, 2018, 6:47am

So I’ve let it run overnight and the number of queued jobs have increased from 1,362,215 to 1,397,195 and again as soon as I reboot / rebuild to be able to view the sidekiq dashboard within seconds the site falls over and returns 500 errors.

Can the sidekiq worker run if the site is returning 500 errors?

gerhard · June 12, 2018, 9:35am

Hmm, I would have expected it to start working again. Something is not right…

Sidekiq should be working in the background even though the server returns errors. An increasing number of enqueued jobs could be normal, as long the the count of processed jobs increased over night as well. Did it?

You could try taking a look at Sidekiq in the rails console:

./launcher enter app
rails c

Sidekiq::Stats.new

You can take a look at https://github.com/mperham/sidekiq/wiki/API if you want to know what else you can do with the Sidekiq API.

danmaby · June 12, 2018, 10:08am

The processed jobs aren’t increasing whilst the server is down. Each time I do a reboot / rebuild the process jobs increase for the 25-30 seconds the server is up, then they stop. When I do another reboot / rebuild cycle the number of processed jobs is the same as when the server crashes.

When I run

./launcher enter app
rails c

I see the same MISCONF Redis error:

# rails c

bundler: failed to load command: pry (/var/www/discourse/vendor/bundle/ruby/2.4.0/bin/pry)

Redis::CommandError: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Would exporting the Discourse database and building another fresh install, then importing the data be a sensible route?

gerhard · June 12, 2018, 10:24am

Not sure if moving to a fresh server will help. You might run into the same issues, because you will need to rebake all posts.

I’d rather try to temporarily upgrade your server to give it more memory. Did you enable swap on your server? If not, you should. Also, take a look at https://meta.discourse.org/t/tons-of-redis-errors/48402

danmaby · June 12, 2018, 11:28am

Upgrading to an 8GB / 4CPU droplet.

Because it’s running on SSD is enabling swap a good idea? Thanks for the post, yes I’ve read that one a few times now.

It’s looking like memory may be the issues:

# free -m
              total        used        free      shared  buff/cache   available
Mem:           3951        2456         132         541        1362         648
Swap:             0           0           0

danmaby · June 12, 2018, 12:12pm

@gerhard thank you again for your assistance! Upgrading the RAM has got the site running, feels like such an obvious solution! I was relying on the DO dashboard that was telling me the servers RAM was only at 60% capacity, lesson learned there!

It’s been running for 30 minutes without issue and sidekiq has worked through 20k of the 1.5m queued items!

gerhard · June 12, 2018, 1:00pm

I’m glad that it’s working now.

Yes, we always recommend enabling swap. Create a swapfile for your Linux server

danmaby · June 13, 2018, 9:22am

This seems extreme now! We’re running an 8 GB / 4 core DO droplet that’s getting no traffic and the site keeps falling over. Is this likely to be down to the sidekiq queue? The queue is currently at 3.1 million records and growing, it’s processed 600k in the last 36 hours with all our outages.

# free -m
              total        used        free      shared  buff/cache   available
Mem:           7983        4728         132        1622        3122        1338
Swap:          2047          30        2017

We’re running UNICORN_WORKERS 8

At it’s current process rate of 10 queued items a second, provided the serve stays up we’re looking at a further 86 hours. We imported around 220,000 posts, do these numbers/times seem normal?

bartv · June 13, 2018, 9:34am

It seems high to me. Did you determine what the newly created jobs are? Do you have a lot of failed jobs? That might point you to an issue too.

danmaby · June 13, 2018, 9:42am

Hey @bartv thanks for the reply.

The Critical queue has around 710k items, all of which are similar to:

{"user_id"=>428778, "topic_id"=>433002, "input"=>"topic_notification_level_changed", "current_site_id"=>"default"}

And the Default queue has around 2.4 million items with various arguments:

{"post_id"=>453284, "bypass_bump"=>false, "cooking_options"=>nil, "current_site_id"=>"default"}
{"topic_id"=>434377, "current_site_id"=>"default"}
{"post_id"=>453284, "new_record"=>true, "options"=>{"skip_send_email"=>true}, "current_site_id"=>"default"}

There’s only 347 that have failed.

bartv · June 13, 2018, 9:57am

I’m not familiar with the topic_notification_level_changed job. Is that the one that’s keeping your Unicorn workers busy? 10 jobs/second is very slow - I only experienced that when rebaking posts with a lot of graphics in them that needed resizing.

danmaby · June 13, 2018, 10:03am

TBH I’m not sure how to check what’s keeping the Unicorn Workers busy? We didn’t include any images in the import, so this sounds a little concerning.

Topic		Replies	Views
After a clean install : Sidekiq error Bug	18	2370	December 7, 2015
Sidekiq has a lot of errors and queued jobs Support	19	986	March 1, 2024
Sidekiq not running. Sidekiq heartbeat test failed, restarting Installation unsupported-install	15	2728	June 10, 2020
"Ensure sidekiq is running." when it is definitely running Installation	19	7675	October 24, 2015
Sidekiq is consuming too much memory, restarting Installation	40	8701	October 13, 2020

Could sidekiq queue be reason for 500 errors?

Related topics