Sidekiq CPU load since latest release


(Michael - DiscourseHosting.com) #1

Since updating to the latest release (v0.9.6.2) CPU usage on all our servers went up by 20%
Cause seems to be sidekiq.

We upgraded from version v0.9.6.

Top output:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                           
30687 www-data  20   0 1004m 228m 2676 S 26.6 22.9  67:31.56 ruby1.9.1                                                                                                         
    1 root      20   0 24196  476    0 S  0.0  0.0   0:06.13 init                                                                                                              
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.35 kthreadd                                                                                                          
    3 root      20   0     0    0    0 S  0.0  0.0   0:56.23 ksoftirqd/0                                                                                                       
8< 

   25 root      20   0     0    0    0 S  0.0  0.0   0:00.00 fsnotify_mark                                                                                                     
   26 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ecryptfs-kthrea                                                                                                   
root@test002:~# ps ax|grep 30687
 2584 pts/0    S+     0:00 grep --color=auto 30687
30687 ?        Sl    67:32 sidekiq 2.13.0 discourse [0 of 25 busy]     

Cacti (host on the bottom left has been updated a few days earlier)

Any ideas?


(Ben T) #2

Is something unusual going on with sidekiq, like a bunch of failed jobs? You can check up on it by going to <your board url>/sidekiq. If you’re using blue pill, check out this post as well.

http://meta.discourse.org/t/there-is-nothing-about-clockwork-in-official-setup-guide/9086/10


(Michael - DiscourseHosting.com) #3

Sidekiq seems happy as a little kid from a functional point of view: no errors, no sign of problems.

As far as I can see that other post only has to do with removal of Clockwork. But yes, I suspect Sidetiq a bit.


(Ben T) #4

Well, I meant to be sure that you don’t have clokwork also running after updating; that’s one of the most recent sidekiq notes I’ve seen. You should be able to break out the processes further than just “ruby” in top or htop and figure out what’s causing the high load (is it one of the thin sockets?)


(Michael - DiscourseHosting.com) #5

It’s sidekiq for sure. See the last lines of the output I pasted before where I grep for the PID (that was on top in the top command) and identify sidekiq as the culprit.

root@test002:~# ps ax|grep 30687
 2584 pts/0    S+     0:00 grep --color=auto 30687
30687 ?        Sl    67:32 sidekiq 2.13.0 discourse [0 of 25 busy]

After updating, clockwork won’t even run anymore, BTW.


(Stephan) #6

It’s a known problem


(Sam Saffron) #7

We should definitely move all our schedules off ice cube, in fact I would support killing that dep


(Michael - DiscourseHosting.com) #8

The known problem states “(it tends to eat up 100% CPU for quite a while).”

On one server, it has been eating 20% CPU for over five days. I guess that’s more than “quite a while” ?
I doubt that this is the same problem.


(Sam Saffron) #9

does not sound like the same issue but could be related, we need to debug this, (and remove that ice cube dep concurrently) … @supermathie are you seeing this issue locally?

Edit, I have a local repro, will sort this tomorrow


(srid) #10

I’m seeing this on DigitalOcean with 0.9.6.2. CPU usage has been consistently hovering at 40% since yesterday when I upgraded Discourse.


(Sam Saffron) #11

I have a complete repro on local, the thread that is figuring out which recurring job it should schedule next is taking 100ms of work every 1 second on my super beefy dev box.

I am working on a fix but it is taking a while cause it involves 2 other gems.


(Sam Saffron) #12

I fixed this (temporarily) using:

https://github.com/discourse/discourse/commit/d3c5afbb80ecc44a89dce3ebc868889216677acc

I raised this issue for a long term solution:

https://github.com/tobiassvn/sidetiq/issues/31


(Sam Saffron) #13

The known problem is not correctly documented, it will chew up the CPU on EVERY run which is 1 second by default.


(Michael - DiscourseHosting.com) #14

Fixed :smiley:


(Jeff Atwood) #15

Is this solved @sam?


(Chris Branson) #16

I’m seeing constant 99% + cpu usage on my sidekiq process after updating to 0.9.8 today (also tried master, same problem).

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
997 discours  20   0  927m 154m 9104 S 99.6 15.6  10:41.73 ruby
795 discours  20   0  203m  17m 1820 S  0.3  1.8   0:01.70 ruby
1 root      20   0 24344 2176 1292 S  0.0  0.2   0:00.70 init
2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
3 root      20   0     0    0    0 S  0.0  0.0   0:00.15 ksoftirqd/0

ps ax|grep 997
997 ?        Sl    12:54 sidekiq 2.15.1 discourse [0 of 25 busy]

(Michael - DiscourseHosting.com) #17

We just found out about this as well. Broken again somewhere between 0.9.7.9 (or more specific: last Saturday we went to latest on a test server) and 0.9.8.
We’re stopping the server upgrade cycle.

This is what we’re getting in the logs

2014-01-10T12:07:51Z 17472 TID-6ldog INFO: [Sidetiq] Sidetiq::Supervisor start
2014-01-10T12:07:51Z 17472 TID-k7oy0 INFO: [Sidetiq] Sidetiq::Actor::Clock id: 33952480 initialize
2014-01-10T12:07:51Z 17472 TID-k5gpo INFO: [Sidetiq] Sidetiq::Actor::Handler id: 33849140 initialize
2014-01-10T12:07:51Z 17472 TID-k5974 INFO: [Sidetiq] Sidetiq::Actor::Handler id: 33838740 initialize
2014-01-10T12:59:36Z 21109 TID-fepr4 WARN: [Sidetiq] Can't link Sidetiq::Actor::Clock. Sidekiq::Manager not running. Retrying in 5 seconds ...
2014-01-10T12:59:36Z 21109 TID-fd8vs WARN: [Sidetiq] Can't link Sidetiq::Actor::Handler. Sidekiq::Manager not running. Retrying in 5 seconds ...
2014-01-10T12:59:36Z 21109 TID-fcgj8 WARN: [Sidetiq] Can't link Sidetiq::Actor::Handler. Sidekiq::Manager not running. Retrying in 5 seconds ...

(Michael - DiscourseHosting.com) #18

Our test server is running on a single CPU, there Sidekiq crashes within 1 minute.

2014-01-10T13:00:59Z 21109 TID-h64a8 WARN: Sidekiq died due to the following error, cannot recover, process exiting
2014-01-10T13:00:59Z 21109 TID-h64a8 WARN: linking timeout of 5 seconds exceeded
2014-01-10T13:00:59Z 21109 TID-h64a8 WARN: /var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid/actor.rb:217:in `block (2 levels) in linking_request'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid/actor.rb:203:in `loop'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid/actor.rb:203:in `block in linking_request'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid/tasks.rb:109:in `exclusive'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid.rb:431:in `exclusive'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid/actor.rb:198:in `linking_request'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid/actor.rb:90:in `monitor'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid/actor.rb:101:in `link'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid.rb:383:in `link'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid/calls.rb:25:in `public_send'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid/calls.rb:25:in `dispatch'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid/calls.rb:67:in `dispatch'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid/actor.rb:322:in `block in handle_message'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid/actor.rb:416:in `block in task'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid/tasks.rb:55:in `block in initialize'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/celluloid-0.15.2/lib/celluloid/tasks/task_fiber.rb:13:in `block in create'

On the more beefy production servers it does keep running, but I’m not sure if it’s really working.


(Passante) #19

Same for me.
Hosted on Digital Ocean (1GB RAM, 30GB SSD, 1 CPU) with discourse_docker image by sam.
We are experiencing 100% CPU due to two sidekiq daemons
Here a picture:
http://dropcanvas.com/n0gf1/1

We upgraded to 0.9.8 (but also the version before was behaving the same).
The forum is extremely slow and fails while posting and reloading.


(Michael - DiscourseHosting.com) #20

To be honest, that sounds like a different problem. Two Sidekiq deamons is always trouble, and this definitely was not an issue at 0.9.7.9.

I would start by killing one of those two sidekiq deamons.