Locking issues in sidekiq jobs?

twice already we had cases where sidekiq was doing 100% cpu load if we strace the 2 processes shown maxing out the cpus we get:

[pid 17688] futex(0x1e9b6e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x1e9b6e0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b718, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b71c, FUTEX_WAIT_PRIVATE, 759265, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17688] futex(0x1e9b6e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x1e9b6e0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b718, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b71c, FUTEX_WAIT_PRIVATE, 759267, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17688] sched_yield()               = 0
[pid 17688] futex(0x1e9b6e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x1e9b6e0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b718, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b71c, FUTEX_WAIT_PRIVATE, 759269, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17688] futex(0x1e9b6e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x1e9b6e0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b718, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b71c, FUTEX_WAIT_PRIVATE, 759271, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 17688] write(6, "!", 1)            = 1
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b6e0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17688] futex(0x1e9b6e4, FUTEX_WAIT_PRIVATE, 2478513, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 17688] futex(0x1e9b6b0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 17688] sched_yield()               = 0

Any advice how to debug the issue?

running: discourse-1.9.0.beta8~git1.3bdade8970

a few updates

  1. it didnt show any busy jobs while we saw the high cpu usage.
  2. Jobs::PullHotlinkedImages was the only job scheduled when we checked why the cpu usage was so high

Were you ever able to pinpoint this issue? We’re seeing the exact same thing on one server. Sidekiq goes to 90% CPU and the only way to get it back to normal is to restart it. It then stays low for a day or so and then it gets stuck at 90% again.

ps -T -p <pid> says this is happening in manager.rb:27 which is here

https://github.com/discourse/mini_scheduler/blob/master/lib/mini_scheduler/manager.rb#L27

There are no scheduled jobs in Sidekiq.

Use rbtrace to get a better back trace

https://samsaffron.com/archive/2018/01/18/my-production-ruby-on-rails-cpu-is-at-100-now-what

2 Likes

I believe it is on @david’s lower-priority list to have better logging of Sidekiq jobs? I am not sure where that is though.

The plan is to add additional logging to track sidekiq job duration, and various metrics such as DB time and Redis time.

We’ll also regularly log “currently running jobs” on a regular basis, to help track things down in the event of a crash.

I think it should be ready in 2-3 weeks time.

4 Likes

Ok, this is what I got. I was hoping to spot the culprit more easily but maybe @sam does know better how to interpret these results…

==================================
  Mode: cpu(1000)
  Samples: 7994 (0.32% miss rate)
  GC: 10 (0.13%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
      3301  (41.3%)        3300  (41.3%)     ActiveModel::LazyAttributeHash#[]
      2123  (26.6%)        2123  (26.6%)     ActiveRecord::ConnectionAdapters::TransactionState#finalized?
       897  (11.2%)         897  (11.2%)     ActiveRecord::AttributeMethods::PrimaryKey::ClassMethods#primary_key
       787   (9.8%)         786   (9.8%)     ActiveModel::Attribute#value
      7647  (95.7%)         118   (1.5%)     ActiveRecord::AttributeMethods::PrimaryKey#id
      3414  (42.7%)         113   (1.4%)     ActiveModel::AttributeSet#[]
      4400  (55.0%)         104   (1.3%)     ActiveRecord::AttributeMethods::Read#_read_attribute
      4296  (53.7%)          96   (1.2%)     ActiveModel::AttributeSet#fetch_value
      2210  (27.6%)          87   (1.1%)     ActiveRecord::Transactions#update_attributes_from_transaction_state
        55   (0.7%)          55   (0.7%)     Nokogiri::XML::Document#decorate
      7697  (96.3%)          51   (0.6%)     ActiveRecord::Core#==
      2234  (27.9%)          24   (0.3%)     ActiveRecord::Transactions#sync_with_transaction_state
        13   (0.2%)          13   (0.2%)     ActionView::PathResolver#find_template_paths
        12   (0.2%)          12   (0.2%)     PG::Connection#async_exec
        11   (0.1%)          11   (0.1%)     Nokogiri::XML::XPathContext#register_namespaces
        10   (0.1%)          10   (0.1%)     (garbage collection)
         8   (0.1%)           8   (0.1%)     OpenSSL::Buffering#do_write
         5   (0.1%)           5   (0.1%)     Concurrent::Collection::NonConcurrentMapBackend#get_or_default
         5   (0.1%)           5   (0.1%)     Concurrent::Collection::NonConcurrentMapBackend#[]
         4   (0.1%)           4   (0.1%)     Arel::Collectors::PlainString#<<
         4   (0.1%)           4   (0.1%)     ActiveRecord::Relation#initialize_copy
         4   (0.1%)           4   (0.1%)     ActiveSupport::PerThreadRegistry#instance
         3   (0.0%)           3   (0.0%)     #<Module:0x0000000004808f78>.parse
         3   (0.0%)           3   (0.0%)     OpenSSL::SSL::SSLContext#initialize
         3   (0.0%)           3   (0.0%)     #<Module:0x0000000007cbb540>.blank?
         3   (0.0%)           3   (0.0%)     Net::Protocol#ssl_socket_connect
         4   (0.1%)           3   (0.0%)     ActiveModel::Type::Helpers::Numeric#cast
         4   (0.1%)           3   (0.0%)     Nokogiri::XML::Node#write_to
         4   (0.1%)           3   (0.0%)     #<Module:0x00000000063c6648>.reap_connections
         4   (0.1%)           3   (0.0%)     MiniRacer::Context#timeout

This issue went away and then it came back, and now we finally seem to have found what was going on here.

There was a database query hanging which was being triggered from a regular job cleaning up users. It never got to finish because after half an hour it was being killed and restarted. When we performed that task manually the issue was gone.

3 Likes