Sidekiq hangs (on BotInput job?)

RGJ · February 17, 2025, 8:21am

In the past week we have seen three Sidekiq instances on different forums being stuck. There was nothing special going on, it was just that Sidekiq was not processing any work and showing 5 of 5 jobs being processed.

One interesting thing they all had in common was that there was one critical BotInput job among the jobs. Now this is quite a common job, but it still stands out.

After restarting Sidekiq everything works normal again. Manually queuing a job with the same parameters does not cause it to hang again. There is nothing special with the specific post it was called for.

Does anyone have any idea how we could track down what is going on here?

Shauny · February 17, 2025, 9:34pm

We have also been having hangs like this, and our host can’t figure out what is causing it.

tgxworld · February 18, 2025, 1:30am

Do you have a screenshot of what you are seeing in the dashboard?

If you can, please try sending the Sidekiq process the TTIN signal and provide the backtrace here.

RGJ · February 20, 2025, 8:05am

Sorry, took a while before this happened again.

sidekiq-clean.txt (35.8 KB)

Summary of the logs

[default] Thread TID-1ow77 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/sidekiq-6.5.12/lib/sidekiq/cli.rb:199:in `backtrace'
--
[default] Thread TID-1o1jr 
[default] /var/www/discourse/lib/demon/base.rb:234:in `sleep'
--
[default] Thread TID-1o1j7 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/connection/ruby.rb:57:in `wait_readable'
--
[default] Thread TID-1o1j3 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/message_bus-4.3.8/lib/message_bus/timer_thread.rb:130:in `sleep'
--
[default] Thread TID-1o1ij AR Pool Reaper
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/connection_pool/reaper.rb:49:in `sleep'
--
[default] Thread TID-1o1hj 
[default] <internal:thread_sync>:18:in `pop'
--
[default] Thread TID-1o1gz AR Pool Reaper
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/connection_pool/reaper.rb:49:in `sleep'
--
[default] Thread TID-1o1gv 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler/manager.rb:18:in `sleep'
--
[default] Thread TID-1o1gb 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler/manager.rb:32:in `sleep'
--
[default] Thread TID-1otmb 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otkn 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otjz 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otif 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1othr 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1o1fb 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler.rb:80:in `sleep'
--
[default] Thread TID-1o1er 
[default] /var/www/discourse/lib/mini_scheduler_long_running_job_logger.rb:87:in `sleep'
--
[default] Thread TID-1o1en heartbeat
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/sidekiq-6.5.12/lib/sidekiq/launcher.rb:76:in `sleep'
--
[default] Thread TID-1o1e3 scheduler
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/connection_pool-2.5.0/lib/connection_pool/timed_stack.rb:79:in `sleep'
--
[default] Thread TID-1ot4n processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot67 processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot8j processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot5n processor
[default] /usr/local/lib/ruby/3.3.0/bundled_gems.rb:74:in `require'
--
[default] Thread TID-1ot6b processor
[default] /var/www/discourse/lib/distributed_mutex.rb:5:in `<main>'
--
[default] Thread TID-1o0kn final-destination_resolver_thread
[default] <internal:thread_sync>:18:in `pop'
--
[default] Thread TID-1o0k3 Timeout stdlib thread
[default] <internal:thread_sync>:18:in `pop'

RGJ · March 3, 2025, 7:57am

@tgxworld did you have a chance to look at the backtrace?

Isambard · March 3, 2025, 9:08am

I have been having Sidekiq issues since a forum upgrade a month ago. What command do you use to restart Sidekiq? Just a sv restart sidekiq?

tgxworld · March 5, 2025, 1:17am

Sorry I have not had a chance to take a look yet. Will try and get to it sometime this week.

markschmucker · March 6, 2025, 7:46pm

I’m seeing this in the past few days. Eventually all jobs stop running. Previously I rebooted, but is it safe to delete the critical queue? Is it a redis queue?

I’m up-to-date at 3.5.0.beta1-dev.

Just a wild guess, but sometimes when I’m chatting with the bot it stops responding so I refresh the page or give up. Maybe those cases leave a job hanging?

RGJ · March 6, 2025, 8:21pm

These jobs are asynchronous so they wouldn’t even know that you did that.

It’s interesting to hear that you are having this on Jobs::BotInput as well. We’re seeing this issue on only a small subset of all our servers (a few percent) and it seems to be the instances that use the narrative bot quite heavily.

No, you would lose all the other queued jobs as well.

The most easy and safe way is sv reload unicorn from within the container.

markschmucker · March 7, 2025, 1:39am

That’s not the case with our forum. AI is only visible to staff and I’ve confirmed no staffers are using it.

I’ve disabled AI for now.

RGJ · March 7, 2025, 8:05am

BotInput is a job from the Discourse Narrative Bot (aka Discobot), not the AI bot.

markschmucker · March 7, 2025, 10:45am

Ah. I have been using the API heavily, as the username discobot.

tgxworld · March 10, 2025, 8:36am

I had a look at the backtraces and it all points to some problem with the following line:

github.com/discourse/discourse

plugins/discourse-narrative-bot/lib/discourse_narrative_bot/new_user_narrative.rb

85e525a8d

# frozen_string_literal: true

require "distributed_mutex"

module DiscourseNarrativeBot
  class NewUserNarrative < Base
    I18N_KEY = "discourse_narrative_bot.new_user_narrative".freeze
    BADGE_NAME = "Certified".freeze

    TRANSITION_TABLE = {
      begin: {
        init: {
          next_state: :tutorial_bookmark,
          next_instructions:
            Proc.new { I18n.t("#{I18N_KEY}.bookmark.instructions", base_uri: Discourse.base_path) },
          action: :say_hello,
        },
      },
      tutorial_bookmark: {
        next_state: :tutorial_onebox,

This file has been truncated. show original

Not exactly sure why that line would cause problems though but it is a line that is not necessary so I’ve dropped it in

@RGJ Do you happen to have Rails.application.config.eager_load set to disable for some reason?

RGJ · March 10, 2025, 8:58am

Interesting find, thank you for looking into it.
It’s hard to tell when such an intermittent problem goes away. I have removed that line on the three instances that hung the most often (one of them almost daily). I will check back in here either:

when one of those instances hangs (we then know that this did not do the trick)
on Friday if none of them hung (we can then start assuming it was the solution)

Nope, didn’t mess with that.

But someone did…

Rails.autoloaders.main.do_not_eager_load(config.root.join("lib"))

at Blaming discourse/config/application.rb at main · discourse/discourse · GitHub

tgxworld · March 14, 2025, 1:11am

@loic Do you recall why we do not eager load the lib directory even in production?

RGJ · March 14, 2025, 8:30am

While the issues have been occuring this week, they haven’t been happening on the three instances where we removed that require line, so I think we can safely assume that this is the culprit . Thank you for spotting that @tgxworld , I would have never found that.

Would you be able to backport that fix to stable?

loic · March 14, 2025, 8:50am

It’s related to what’s explained here (when we upgraded to Rails 7.1): Upgrading Ruby on Rails — Ruby on Rails Guides

I don’t remember the exact problem, but we actually kept the previous behavior, having to require things from lib.

Topic		Replies	Views
Sidekiq has a lot of errors and queued jobs Support	19	992	March 1, 2024
Sidekiq is being paused, how can I discover why? Support	18	3112	September 20, 2018
"Ensure sidekiq is running." when it is definitely running Installation	19	7679	October 24, 2015
Sidekiq stops after some time Installation	8	1076	July 14, 2023
Could sidekiq queue be reason for 500 errors? Installation server-resources	31	3801	July 13, 2018

Sidekiq hangs (on BotInput job?)

Related topics