Locking issues in sidekiq jobs?

darix · September 3, 2017, 1:49pm

twice already we had cases where sidekiq was doing 100% cpu load if we strace the 2 processes shown maxing out the cpus we get:

[pid 17688] futex(0x1e9b6e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x1e9b6e0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b718, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b71c, FUTEX_WAIT_PRIVATE, 759265, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17688] futex(0x1e9b6e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x1e9b6e0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b718, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b71c, FUTEX_WAIT_PRIVATE, 759267, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17688] sched_yield()               = 0
[pid 17688] futex(0x1e9b6e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x1e9b6e0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b718, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b71c, FUTEX_WAIT_PRIVATE, 759269, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17688] futex(0x1e9b6e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x1e9b6e0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b718, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b71c, FUTEX_WAIT_PRIVATE, 759271, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 17688] write(6, "!", 1)            = 1
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b6e0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b6e4, FUTEX_WAIT_PRIVATE, 2478513, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17688] sched_yield()               = 0

Any advice how to debug the issue?

running: discourse-1.9.0.beta8~git1.3bdade8970

darix · September 3, 2017, 2:05pm

a few updates

it didnt show any busy jobs while we saw the high cpu usage.
Jobs::PullHotlinkedImages was the only job scheduled when we checked why the cpu usage was so high

RGJ · January 2, 2019, 5:06pm

Were you ever able to pinpoint this issue? We’re seeing the exact same thing on one server. Sidekiq goes to 90% CPU and the only way to get it back to normal is to restart it. It then stays low for a day or so and then it gets stuck at 90% again.

ps -T -p <pid> says this is happening in manager.rb:27 which is here

https://github.com/discourse/mini_scheduler/blob/master/lib/mini_scheduler/manager.rb#L27

There are no scheduled jobs in Sidekiq.

sam · January 2, 2019, 7:45pm

Use rbtrace to get a better back trace

https://samsaffron.com/archive/2018/01/18/my-production-ruby-on-rails-cpu-is-at-100-now-what

codinghorror · January 2, 2019, 9:31pm

I believe it is on @david’s lower-priority list to have better logging of Sidekiq jobs? I am not sure where that is though.

david · January 2, 2019, 9:44pm

The plan is to add additional logging to track sidekiq job duration, and various metrics such as DB time and Redis time.

We’ll also regularly log “currently running jobs” on a regular basis, to help track things down in the event of a crash.

I think it should be ready in 2-3 weeks time.

RGJ · January 3, 2019, 4:12pm

Ok, this is what I got. I was hoping to spot the culprit more easily but maybe @sam does know better how to interpret these results…

==================================
  Mode: cpu(1000)
  Samples: 7994 (0.32% miss rate)
  GC: 10 (0.13%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
      3301  (41.3%)        3300  (41.3%)     ActiveModel::LazyAttributeHash#[]
      2123  (26.6%)        2123  (26.6%)     ActiveRecord::ConnectionAdapters::TransactionState#finalized?
       897  (11.2%)         897  (11.2%)     ActiveRecord::AttributeMethods::PrimaryKey::ClassMethods#primary_key
       787   (9.8%)         786   (9.8%)     ActiveModel::Attribute#value
      7647  (95.7%)         118   (1.5%)     ActiveRecord::AttributeMethods::PrimaryKey#id
      3414  (42.7%)         113   (1.4%)     ActiveModel::AttributeSet#[]
      4400  (55.0%)         104   (1.3%)     ActiveRecord::AttributeMethods::Read#_read_attribute
      4296  (53.7%)          96   (1.2%)     ActiveModel::AttributeSet#fetch_value
      2210  (27.6%)          87   (1.1%)     ActiveRecord::Transactions#update_attributes_from_transaction_state
        55   (0.7%)          55   (0.7%)     Nokogiri::XML::Document#decorate
      7697  (96.3%)          51   (0.6%)     ActiveRecord::Core#==
      2234  (27.9%)          24   (0.3%)     ActiveRecord::Transactions#sync_with_transaction_state
        13   (0.2%)          13   (0.2%)     ActionView::PathResolver#find_template_paths
        12   (0.2%)          12   (0.2%)     PG::Connection#async_exec
        11   (0.1%)          11   (0.1%)     Nokogiri::XML::XPathContext#register_namespaces
        10   (0.1%)          10   (0.1%)     (garbage collection)
         8   (0.1%)           8   (0.1%)     OpenSSL::Buffering#do_write
         5   (0.1%)           5   (0.1%)     Concurrent::Collection::NonConcurrentMapBackend#get_or_default
         5   (0.1%)           5   (0.1%)     Concurrent::Collection::NonConcurrentMapBackend#[]
         4   (0.1%)           4   (0.1%)     Arel::Collectors::PlainString#<<
         4   (0.1%)           4   (0.1%)     ActiveRecord::Relation#initialize_copy
         4   (0.1%)           4   (0.1%)     ActiveSupport::PerThreadRegistry#instance
         3   (0.0%)           3   (0.0%)     #<Module:0x0000000004808f78>.parse
         3   (0.0%)           3   (0.0%)     OpenSSL::SSL::SSLContext#initialize
         3   (0.0%)           3   (0.0%)     #<Module:0x0000000007cbb540>.blank?
         3   (0.0%)           3   (0.0%)     Net::Protocol#ssl_socket_connect
         4   (0.1%)           3   (0.0%)     ActiveModel::Type::Helpers::Numeric#cast
         4   (0.1%)           3   (0.0%)     Nokogiri::XML::Node#write_to
         4   (0.1%)           3   (0.0%)     #<Module:0x00000000063c6648>.reap_connections
         4   (0.1%)           3   (0.0%)     MiniRacer::Context#timeout

RGJ · August 26, 2019, 4:02pm

This issue went away and then it came back, and now we finally seem to have found what was going on here.

There was a database query hanging which was being triggered from a regular job cleaning up users. It never got to finish because after half an hour it was being killed and restarted. When we performed that task manually the issue was gone.

Topic		Replies	Views
CPU usage increases steadily Support	15	126	April 7, 2025
Sidekiq stops after some time Installation	8	1074	July 14, 2023
Sidekiq hangs (on BotInput job?) Support discobot	16	189	March 14, 2025
Sidekiq has a lot of errors and queued jobs Support	19	986	March 1, 2024
Sidekiq dying and not coming back up Installation	6	1354	April 24, 2022

Locking issues in sidekiq jobs?

Related topics