Sidekiq hangs (on BotInput job?)

In the past week we have seen three Sidekiq instances on different forums being stuck. There was nothing special going on, it was just that Sidekiq was not processing any work and showing 5 of 5 jobs being processed.

One interesting thing they all had in common was that there was one critical BotInput job among the jobs. Now this is quite a common job, but it still stands out.

After restarting Sidekiq everything works normal again. Manually queuing a job with the same parameters does not cause it to hang again. There is nothing special with the specific post it was called for.

Does anyone have any idea how we could track down what is going on here?

1 Like

We have also been having hangs like this, and our host canā€™t figure out what is causing it.

Do you have a screenshot of what you are seeing in the dashboard?

If you can, please try sending the Sidekiq process the TTIN signal and provide the backtrace here.

1 Like

Sorry, took a while before this happened again.

sidekiq-clean.txt (35.8 KB)

Summary of the logs

[default] Thread TID-1ow77 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/sidekiq-6.5.12/lib/sidekiq/cli.rb:199:in `backtrace'
--
[default] Thread TID-1o1jr 
[default] /var/www/discourse/lib/demon/base.rb:234:in `sleep'
--
[default] Thread TID-1o1j7 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/connection/ruby.rb:57:in `wait_readable'
--
[default] Thread TID-1o1j3 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/message_bus-4.3.8/lib/message_bus/timer_thread.rb:130:in `sleep'
--
[default] Thread TID-1o1ij AR Pool Reaper
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/connection_pool/reaper.rb:49:in `sleep'
--
[default] Thread TID-1o1hj 
[default] <internal:thread_sync>:18:in `pop'
--
[default] Thread TID-1o1gz AR Pool Reaper
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/connection_pool/reaper.rb:49:in `sleep'
--
[default] Thread TID-1o1gv 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler/manager.rb:18:in `sleep'
--
[default] Thread TID-1o1gb 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler/manager.rb:32:in `sleep'
--
[default] Thread TID-1otmb 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otkn 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otjz 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otif 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1othr 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1o1fb 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler.rb:80:in `sleep'
--
[default] Thread TID-1o1er 
[default] /var/www/discourse/lib/mini_scheduler_long_running_job_logger.rb:87:in `sleep'
--
[default] Thread TID-1o1en heartbeat
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/sidekiq-6.5.12/lib/sidekiq/launcher.rb:76:in `sleep'
--
[default] Thread TID-1o1e3 scheduler
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/connection_pool-2.5.0/lib/connection_pool/timed_stack.rb:79:in `sleep'
--
[default] Thread TID-1ot4n processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot67 processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot8j processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot5n processor
[default] /usr/local/lib/ruby/3.3.0/bundled_gems.rb:74:in `require'
--
[default] Thread TID-1ot6b processor
[default] /var/www/discourse/lib/distributed_mutex.rb:5:in `<main>'
--
[default] Thread TID-1o0kn final-destination_resolver_thread
[default] <internal:thread_sync>:18:in `pop'
--
[default] Thread TID-1o0k3 Timeout stdlib thread
[default] <internal:thread_sync>:18:in `pop'

@tgxworld did you have a chance to look at the backtrace?

I have been having Sidekiq issues since a forum upgrade a month ago. What command do you use to restart Sidekiq? Just a sv restart sidekiq?

Sorry I have not had a chance to take a look yet. Will try and get to it sometime this week.

1 Like

Iā€™m seeing this in the past few days. Eventually all jobs stop running. Previously I rebooted, but is it safe to delete the critical queue? Is it a redis queue?

Iā€™m up-to-date at 3.5.0.beta1-dev.

Just a wild guess, but sometimes when Iā€™m chatting with the bot it stops responding so I refresh the page or give up. Maybe those cases leave a job hanging?

1 Like

These jobs are asynchronous so they wouldnā€™t even know that you did that.

Itā€™s interesting to hear that you are having this on Jobs::BotInput as well. Weā€™re seeing this issue on only a small subset of all our servers (a few percent) and it seems to be the instances that use the narrative bot quite heavily.

No, you would lose all the other queued jobs as well.

The most easy and safe way is sv reload unicorn from within the container.

1 Like

Thatā€™s not the case with our forum. AI is only visible to staff and Iā€™ve confirmed no staffers are using it.

Iā€™ve disabled AI for now.

BotInput is a job from the Discourse Narrative Bot (aka Discobot), not the AI bot.

Ah. I have been using the API heavily, as the username discobot.

I had a look at the backtraces and it all points to some problem with the following line:

Not exactly sure why that line would cause problems though but it is a line that is not necessary so Iā€™ve dropped it in

@RGJ Do you happen to have Rails.application.config.eager_load set to disable for some reason? :thinking:

2 Likes

Interesting find, thank you for looking into it.
Itā€™s hard to tell when such an intermittent problem goes away. I have removed that line on the three instances that hung the most often (one of them almost daily). I will check back in here either:

  • when one of those instances hangs (we then know that this did not do the trick)
  • on Friday if none of them hung (we can then start assuming it was the solution)

Nope, didnā€™t mess with that.

But someone didā€¦

Rails.autoloaders.main.do_not_eager_load(config.root.join("lib"))

at Blaming discourse/config/application.rb at main Ā· discourse/discourse Ā· GitHub

3 Likes

@flink91 Do you recall why we do not eager load the lib directory even in production?

1 Like

While the issues have been occuring this week, they havenā€™t been happening on the three instances where we removed that require line, so I think we can safely assume that this is the culprit :tada: . Thank you for spotting that @tgxworld , I would have never found that.

Would you be able to backport that fix to stable?

Itā€™s related to whatā€™s explained here (when we upgraded to Rails 7.1): Upgrading Ruby on Rails ā€” Ruby on Rails Guides

I donā€™t remember the exact problem, but we actually kept the previous behavior, having to require things from lib.