Could sidekiq queue be reason for 500 errors?

Hi, I’ve completed a migration from phpBB3 to Discourse. For several hours the new install has performed flawlessly, now suddenly the server keeps falling over and returning a 500 error. Each time I need to reboot the server, then rebuild the app to access the site again. The last few times of going through this process it’s displayed the site for just seconds before it falls over again.

The sidekiq queue has around 1,350,200 items to process. Would this be a potential cause for the server to fail? It’s running on a 4GB / 2 core DO droplet that’s averaging around 60% RAM usage when the site’s up.

Looking at the logs I see the following error has been logged 8 times:

Sidekiq heartbeat test failed, restarting

/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/logster-1.2.9/lib/logster/logger.rb:93:in `add_with_opts' /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/logster-1.2.9/lib/logster/logger.rb:50:in `add' /usr/local/lib/ruby/2.4.0/logger.rb:534:in `warn' config/unicorn.conf.rb:182:in `check_sidekiq_heartbeat' config/unicorn.conf.rb:199:in `master_sleep' /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/unicorn-5.4.0/lib/unicorn/http_server.rb:294:in `join' /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/unicorn-5.4.0/bin/unicorn:126:in `<top (required)>' /var/www/discourse/vendor/bundle/ruby/2.4.0/bin/unicorn:23:in `load' /var/www/discourse/vendor/bundle/ruby/2.4.0/bin/unicorn:23:in `<main>'`

Any help appreciated :pray: feel like I’m just hitting and hoping!

The server now fails as soon as it’s rebooted and I run ./launcher rebuild app :frowning:

It has no traffic going to it and still only averaging 60% RAM usage. I’m at a loss as to where to start debugging this?

You will have to let sidekiq do its job. It can take quite a while to empty the queue and you might see some problems until the larger jobs got a chance to complete for the first time. After an import that’s to be expected. Rebooting won’t do you much good. The only thing that helps are more and faster CPUs. Then you could tweak the number of sidekiq workers… otherwise you’ll need lots of patience.

Thanks @gerhard (again!) the problem I’m facing though is the sites down until I run the reboot / rebuild. But then it goes down almost immediately, so sidekiq can’t run?

Let it run a couple of uninterrupted hours (e.g. over night). The site should start working again after some jobs finished in the background. I’ve seen the same behavior multiple times on some of the slower droplets.

2 Likes

I’ve run

tail shared/standalone/log/rails/production.log

and

more shared/standalone/log/rails/production.log

I’m seeing thousands of the following error:

Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Would this be related or am I staring at yet another issue? There’s plenty of memory available:

# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           396M  5.7M  390M   2% /run
/dev/vda1        78G   31G   47G  40% /
tmpfs           2.0G  1.1M  2.0G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
/dev/vda15      105M  3.4M  101M   4% /boot/efi
tmpfs           396M     0  396M   0% /run/user/0
none             78G   31G   47G  40% 
/var/lib/docker/aufs/mnt/9de2464358c18edd4a230e507ff946ad0254344dcf05dd5d9387cd5e49e71620 
shm             512M  8.0K  512M   1%

/var/lib/docker/containers/3449b43abc652f92f02b2c3a733f2e439104d4da266335ada9c0e715d68a8e3b/mounts/shm

Could be a RAM problem. How many posts and users do you have?

@pfaffman the RAM is running at 60% average capacity when the site is up and peaking at around 80%

So I’ve let it run overnight and the number of queued jobs have increased from 1,362,215 to 1,397,195 and again as soon as I reboot / rebuild to be able to view the sidekiq dashboard within seconds the site falls over and returns 500 errors.

Can the sidekiq worker run if the site is returning 500 errors?

Hmm, I would have expected it to start working again. Something is not right… :thinking:

Sidekiq should be working in the background even though the server returns errors. An increasing number of enqueued jobs could be normal, as long the the count of processed jobs increased over night as well. Did it?

You could try taking a look at Sidekiq in the rails console:

./launcher enter app
rails c

Sidekiq::Stats.new

You can take a look at https://github.com/mperham/sidekiq/wiki/API if you want to know what else you can do with the Sidekiq API.

2 Likes

The processed jobs aren’t increasing whilst the server is down. Each time I do a reboot / rebuild the process jobs increase for the 25-30 seconds the server is up, then they stop. When I do another reboot / rebuild cycle the number of processed jobs is the same as when the server crashes.

When I run

./launcher enter app
rails c

I see the same MISCONF Redis error:

# rails c

bundler: failed to load command: pry (/var/www/discourse/vendor/bundle/ruby/2.4.0/bin/pry)

Redis::CommandError: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Would exporting the Discourse database and building another fresh install, then importing the data be a sensible route?

Not sure if moving to a fresh server will help. You might run into the same issues, because you will need to rebake all posts.

I’d rather try to temporarily upgrade your server to give it more memory. Did you enable swap on your server? If not, you should. Also, take a look at https://meta.discourse.org/t/tons-of-redis-errors/48402

2 Likes

Upgrading to an 8GB / 4CPU droplet.

Because it’s running on SSD is enabling swap a good idea? Thanks for the post, yes I’ve read that one a few times now.

It’s looking like memory may be the issues:

# free -m
              total        used        free      shared  buff/cache   available
Mem:           3951        2456         132         541        1362         648
Swap:             0           0           0

@gerhard thank you again for your assistance! Upgrading the RAM has got the site running, feels like such an obvious solution! I was relying on the DO dashboard that was telling me the servers RAM was only at 60% capacity, lesson learned there!

It’s been running for 30 minutes without issue and sidekiq has worked through 20k of the 1.5m queued items!

3 Likes

I’m glad that it’s working now.

Yes, we always recommend enabling swap. Create a swapfile for your Linux server

3 Likes

This seems extreme now! We’re running an 8 GB / 4 core DO droplet that’s getting no traffic and the site keeps falling over. Is this likely to be down to the sidekiq queue? The queue is currently at 3.1 million records and growing, it’s processed 600k in the last 36 hours with all our outages.

# free -m
              total        used        free      shared  buff/cache   available
Mem:           7983        4728         132        1622        3122        1338
Swap:          2047          30        2017

We’re running UNICORN_WORKERS 8

At it’s current process rate of 10 queued items a second, provided the serve stays up :crossed_fingers: we’re looking at a further 86 hours. We imported around 220,000 posts, do these numbers/times seem normal?

It seems high to me. Did you determine what the newly created jobs are? Do you have a lot of failed jobs? That might point you to an issue too.

Hey @bartv thanks for the reply.

The Critical queue has around 710k items, all of which are similar to:

{"user_id"=>428778, "topic_id"=>433002, "input"=>"topic_notification_level_changed", "current_site_id"=>"default"}

And the Default queue has around 2.4 million items with various arguments:

{"post_id"=>453284, "bypass_bump"=>false, "cooking_options"=>nil, "current_site_id"=>"default"}
{"topic_id"=>434377, "current_site_id"=>"default"}
{"post_id"=>453284, "new_record"=>true, "options"=>{"skip_send_email"=>true}, "current_site_id"=>"default"}

There’s only 347 that have failed.

I’m not familiar with the topic_notification_level_changed job. Is that the one that’s keeping your Unicorn workers busy? 10 jobs/second is very slow - I only experienced that when rebaking posts with a lot of graphics in them that needed resizing.

TBH I’m not sure how to check what’s keeping the Unicorn Workers busy? We didn’t include any images in the import, so this sounds a little concerning.