Sidekiq hängt möglicherweise beim BotInput-Job?

RGJ · 17. Februar 2025 um 08:21

In the past week we have seen three Sidekiq instances on different forums being stuck. There was nothing special going on, it was just that Sidekiq was not processing any work and showing 5 of 5 jobs being processed.

One interesting thing they all had in common was that there was one critical BotInput job among the jobs. Now this is quite a common job, but it still stands out.

After restarting Sidekiq everything works normal again. Manually queuing a job with the same parameters does not cause it to hang again. There is nothing special with the specific post it was called for.

Does anyone have any idea how we could track down what is going on here?

Shauny · 17. Februar 2025 um 21:34

We have also been having hangs like this, and our host can’t figure out what is causing it.

tgxworld · 18. Februar 2025 um 01:30

Do you have a screenshot of what you are seeing in the dashboard?

If you can, please try sending the Sidekiq process the TTIN signal and provide the backtrace here.

RGJ · 20. Februar 2025 um 08:05

Sorry, took a while before this happened again.

sidekiq-clean.txt (35.8 KB)

Summary of the logs

[default] Thread TID-1ow77 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/sidekiq-6.5.12/lib/sidekiq/cli.rb:199:in `backtrace'
--
[default] Thread TID-1o1jr 
[default] /var/www/discourse/lib/demon/base.rb:234:in `sleep'
--
[default] Thread TID-1o1j7 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/connection/ruby.rb:57:in `wait_readable'
--
[default] Thread TID-1o1j3 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/message_bus-4.3.8/lib/message_bus/timer_thread.rb:130:in `sleep'
--
[default] Thread TID-1o1ij AR Pool Reaper
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/connection_pool/reaper.rb:49:in `sleep'
--
[default] Thread TID-1o1hj 
[default] <internal:thread_sync>:18:in `pop'
--
[default] Thread TID-1o1gz AR Pool Reaper
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/connection_pool/reaper.rb:49:in `sleep'
--
[default] Thread TID-1o1gv 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler/manager.rb:18:in `sleep'
--
[default] Thread TID-1o1gb 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler/manager.rb:32:in `sleep'
--
[default] Thread TID-1otmb 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otkn 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otjz 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otif 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1othr 
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1o1fb 
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler.rb:80:in `sleep'
--
[default] Thread TID-1o1er 
[default] /var/www/discourse/lib/mini_scheduler_long_running_job_logger.rb:87:in `sleep'
--
[default] Thread TID-1o1en heartbeat
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/sidekiq-6.5.12/lib/sidekiq/launcher.rb:76:in `sleep'
--
[default] Thread TID-1o1e3 scheduler
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/connection_pool-2.5.0/lib/connection_pool/timed_stack.rb:79:in `sleep'
--
[default] Thread TID-1ot4n processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot67 processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot8j processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot5n processor
[default] /usr/local/lib/ruby/3.3.0/bundled_gems.rb:74:in `require'
--
[default] Thread TID-1ot6b processor
[default] /var/www/discourse/lib/distributed_mutex.rb:5:in `<main>'
--
[default] Thread TID-1o0kn final-destination_resolver_thread
[default] <internal:thread_sync>:18:in `pop'
--
[default] Thread TID-1o0k3 Timeout stdlib thread
[default] <internal:thread_sync>:18:in `pop'

RGJ · 3. März 2025 um 07:57

@tgxworld did you have a chance to look at the backtrace?

Isambard · 3. März 2025 um 09:08

I have been having Sidekiq issues since a forum upgrade a month ago. What command do you use to restart Sidekiq? Just a sv restart sidekiq?

tgxworld · 5. März 2025 um 01:17

Sorry I have not had a chance to take a look yet. Will try and get to it sometime this week.

markschmucker · 6. März 2025 um 19:46

I’m seeing this in the past few days. Eventually all jobs stop running. Previously I rebooted, but is it safe to delete the critical queue? Is it a redis queue?

I’m up-to-date at 3.5.0.beta1-dev.

Just a wild guess, but sometimes when I’m chatting with the bot it stops responding so I refresh the page or give up. Maybe those cases leave a job hanging?

RGJ · 6. März 2025 um 20:21

These jobs are asynchronous so they wouldn’t even know that you did that.

It’s interesting to hear that you are having this on Jobs::BotInput as well. We’re seeing this issue on only a small subset of all our servers (a few percent) and it seems to be the instances that use the narrative bot quite heavily.

No, you would lose all the other queued jobs as well.

The most easy and safe way is sv reload unicorn from within the container.

markschmucker · 7. März 2025 um 01:39

That’s not the case with our forum. AI is only visible to staff and I’ve confirmed no staffers are using it.

I’ve disabled AI for now.

RGJ · 7. März 2025 um 08:05

BotInput is a job from the Discourse Narrative Bot (aka Discobot), not the AI bot.

markschmucker · 7. März 2025 um 10:45

Ah. I have been using the API heavily, as the username discobot.

tgxworld · 10. März 2025 um 08:36

I had a look at the backtraces and it all points to some problem with the following line:

github.com/discourse/discourse

plugins/discourse-narrative-bot/lib/discourse_narrative_bot/new_user_narrative.rb

85e525a8d

# frozen_string_literal: true

require "distributed_mutex"

module DiscourseNarrativeBot
  class NewUserNarrative < Base
    I18N_KEY = "discourse_narrative_bot.new_user_narrative".freeze
    BADGE_NAME = "Certified".freeze

    TRANSITION_TABLE = {
      begin: {
        init: {
          next_state: :tutorial_bookmark,
          next_instructions:
            Proc.new { I18n.t("#{I18N_KEY}.bookmark.instructions", base_uri: Discourse.base_path) },
          action: :say_hello,
        },
      },
      tutorial_bookmark: {
        next_state: :tutorial_onebox,

This file has been truncated. show original

Not exactly sure why that line would cause problems though but it is a line that is not necessary so I’ve dropped it in

@RGJ Do you happen to have Rails.application.config.eager_load set to disable for some reason?

RGJ · 10. März 2025 um 08:58

Interesting find, thank you for looking into it.
It’s hard to tell when such an intermittent problem goes away. I have removed that line on the three instances that hung the most often (one of them almost daily). I will check back in here either:

when one of those instances hangs (we then know that this did not do the trick)
on Friday if none of them hung (we can then start assuming it was the solution)

Nope, didn’t mess with that.

But someone did…

Rails.autoloaders.main.do_not_eager_load(config.root.join("lib"))

at Blaming discourse/config/application.rb at main · discourse/discourse · GitHub

tgxworld · 14. März 2025 um 01:11

@loic Do you recall why we do not eager load the lib directory even in production?

RGJ · 14. März 2025 um 08:30

While the issues have been occuring this week, they haven’t been happening on the three instances where we removed that require line, so I think we can safely assume that this is the culprit . Thank you for spotting that @tgxworld , I would have never found that.

Would you be able to backport that fix to stable?

loic · 14. März 2025 um 08:50

It’s related to what’s explained here (when we upgraded to Rails 7.1): Upgrading Ruby on Rails — Ruby on Rails Guides

I don’t remember the exact problem, but we actually kept the previous behavior, having to require things from lib.

Thema		Antworten	Aufrufe
Sidekiq has a lot of errors and queued jobs Support	19	1050	1. März 2024
Sidekiq is being paused, how can I discover why? Support	18	3154	20. September 2018
"Ensure sidekiq is running." when it is definitely running Installation	19	7727	24. Oktober 2015
Sidekiq stops after some time Installation	8	1102	14. Juli 2023
Could sidekiq queue be reason for 500 errors? Installation server-resources	31	3834	13. Juli 2018

Sidekiq hängt möglicherweise beim BotInput-Job?

Verwandte Themen