RGJ
(Richard - Communiteq)
February 17, 2025, 8:21am
1
In the past week we have seen three Sidekiq instances on different forums being stuck. There was nothing special going on, it was just that Sidekiq was not processing any work and showing 5 of 5 jobs being processed.
One interesting thing they all had in common was that there was one critical BotInput
job among the jobs. Now this is quite a common job, but it still stands out.
After restarting Sidekiq everything works normal again. Manually queuing a job with the same parameters does not cause it to hang again. There is nothing special with the specific post it was called for.
Does anyone have any idea how we could track down what is going on here?
1 Like
Shauny
(Shaun Robinson)
February 17, 2025, 9:34pm
2
We have also been having hangs like this, and our host canāt figure out what is causing it.
tgxworld
(Alan Tan)
February 18, 2025, 1:30am
3
Do you have a screenshot of what you are seeing in the dashboard?
If you can, please try sending the Sidekiq process the TTIN
signal and provide the backtrace here.
1 Like
RGJ
(Richard - Communiteq)
February 20, 2025, 8:05am
4
Sorry, took a while before this happened again.
sidekiq-clean.txt (35.8 KB)
Summary of the logs
[default] Thread TID-1ow77
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/sidekiq-6.5.12/lib/sidekiq/cli.rb:199:in `backtrace'
--
[default] Thread TID-1o1jr
[default] /var/www/discourse/lib/demon/base.rb:234:in `sleep'
--
[default] Thread TID-1o1j7
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/connection/ruby.rb:57:in `wait_readable'
--
[default] Thread TID-1o1j3
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/message_bus-4.3.8/lib/message_bus/timer_thread.rb:130:in `sleep'
--
[default] Thread TID-1o1ij AR Pool Reaper
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/connection_pool/reaper.rb:49:in `sleep'
--
[default] Thread TID-1o1hj
[default] <internal:thread_sync>:18:in `pop'
--
[default] Thread TID-1o1gz AR Pool Reaper
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/connection_pool/reaper.rb:49:in `sleep'
--
[default] Thread TID-1o1gv
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler/manager.rb:18:in `sleep'
--
[default] Thread TID-1o1gb
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler/manager.rb:32:in `sleep'
--
[default] Thread TID-1otmb
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otkn
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otjz
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otif
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1othr
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1o1fb
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler.rb:80:in `sleep'
--
[default] Thread TID-1o1er
[default] /var/www/discourse/lib/mini_scheduler_long_running_job_logger.rb:87:in `sleep'
--
[default] Thread TID-1o1en heartbeat
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/sidekiq-6.5.12/lib/sidekiq/launcher.rb:76:in `sleep'
--
[default] Thread TID-1o1e3 scheduler
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/connection_pool-2.5.0/lib/connection_pool/timed_stack.rb:79:in `sleep'
--
[default] Thread TID-1ot4n processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot67 processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot8j processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot5n processor
[default] /usr/local/lib/ruby/3.3.0/bundled_gems.rb:74:in `require'
--
[default] Thread TID-1ot6b processor
[default] /var/www/discourse/lib/distributed_mutex.rb:5:in `<main>'
--
[default] Thread TID-1o0kn final-destination_resolver_thread
[default] <internal:thread_sync>:18:in `pop'
--
[default] Thread TID-1o0k3 Timeout stdlib thread
[default] <internal:thread_sync>:18:in `pop'
RGJ
(Richard - Communiteq)
March 3, 2025, 7:57am
5
@tgxworld did you have a chance to look at the backtrace?
Isambard
(Isambard)
March 3, 2025, 9:08am
7
I have been having Sidekiq issues since a forum upgrade a month ago. What command do you use to restart Sidekiq? Just a sv restart sidekiq
?
tgxworld
(Alan Tan)
March 5, 2025, 1:17am
8
Sorry I have not had a chance to take a look yet. Will try and get to it sometime this week.
1 Like
Iām seeing this in the past few days. Eventually all jobs stop running. Previously I rebooted, but is it safe to delete the critical queue? Is it a redis queue?
Iām up-to-date at 3.5.0.beta1-dev.
Just a wild guess, but sometimes when Iām chatting with the bot it stops responding so I refresh the page or give up. Maybe those cases leave a job hanging?
1 Like
RGJ
(Richard - Communiteq)
March 6, 2025, 8:21pm
10
These jobs are asynchronous so they wouldnāt even know that you did that.
Itās interesting to hear that you are having this on Jobs::BotInput
as well. Weāre seeing this issue on only a small subset of all our servers (a few percent) and it seems to be the instances that use the narrative bot quite heavily.
No, you would lose all the other queued jobs as well.
The most easy and safe way is sv reload unicorn
from within the container.
1 Like
Thatās not the case with our forum. AI is only visible to staff and Iāve confirmed no staffers are using it.
Iāve disabled AI for now.
RGJ
(Richard - Communiteq)
March 7, 2025, 8:05am
12
BotInput
is a job from the Discourse Narrative Bot (aka Discobot), not the AI bot.
Ah. I have been using the API heavily, as the username discobot.
tgxworld
(Alan Tan)
March 10, 2025, 8:36am
14
I had a look at the backtraces and it all points to some problem with the following line:
# frozen_string_literal: true
require "distributed_mutex"
module DiscourseNarrativeBot
class NewUserNarrative < Base
I18N_KEY = "discourse_narrative_bot.new_user_narrative".freeze
BADGE_NAME = "Certified".freeze
TRANSITION_TABLE = {
begin: {
init: {
next_state: :tutorial_bookmark,
next_instructions:
Proc.new { I18n.t("#{I18N_KEY}.bookmark.instructions", base_uri: Discourse.base_path) },
action: :say_hello,
},
},
tutorial_bookmark: {
next_state: :tutorial_onebox,
This file has been truncated. show original
Not exactly sure why that line would cause problems though but it is a line that is not necessary so Iāve dropped it in
main
ā remove_unnecessary_require
opened 08:32AM - 10 Mar 25 UTC
`DistributedMutex` is eager loaded in production/test and autoloaded in
developā¦ ment.
@RGJ Do you happen to have Rails.application.config.eager_load
set to disable for some reason?
2 Likes
RGJ
(Richard - Communiteq)
March 10, 2025, 8:58am
16
Interesting find, thank you for looking into it.
Itās hard to tell when such an intermittent problem goes away. I have removed that line on the three instances that hung the most often (one of them almost daily). I will check back in here either:
when one of those instances hangs (we then know that this did not do the trick)
on Friday if none of them hung (we can then start assuming it was the solution)
Nope, didnāt mess with that.
But someone didā¦
Rails.autoloaders.main.do_not_eager_load(config.root.join("lib"))
at Blaming discourse/config/application.rb at main Ā· discourse/discourse Ā· GitHub
3 Likes
tgxworld
(Alan Tan)
March 14, 2025, 1:11am
17
@flink91 Do you recall why we do not eager load the lib
directory even in production?
1 Like
RGJ
(Richard - Communiteq)
March 14, 2025, 8:30am
18
While the issues have been occuring this week, they havenāt been happening on the three instances where we removed that require
line, so I think we can safely assume that this is the culprit . Thank you for spotting that @tgxworld , I would have never found that.
Would you be able to backport that fix to stable?
flink91
(LoĆÆc Guitaut)
March 14, 2025, 8:50am
19
Itās related to whatās explained here (when we upgraded to Rails 7.1): Upgrading Ruby on Rails ā Ruby on Rails Guides
I donāt remember the exact problem, but we actually kept the previous behavior, having to require things from lib
.