RGJ
(Richard - Communiteq)
17. Februar 2025 um 08:21
1
In the past week we have seen three Sidekiq instances on different forums being stuck. There was nothing special going on, it was just that Sidekiq was not processing any work and showing 5 of 5 jobs being processed.
One interesting thing they all had in common was that there was one critical BotInput
job among the jobs. Now this is quite a common job, but it still stands out.
After restarting Sidekiq everything works normal again. Manually queuing a job with the same parameters does not cause it to hang again. There is nothing special with the specific post it was called for.
Does anyone have any idea how we could track down what is going on here?
1 „Gefällt mir“
Shauny
(Shaun Robinson)
17. Februar 2025 um 21:34
2
We have also been having hangs like this, and our host can’t figure out what is causing it.
tgxworld
(Alan Tan)
18. Februar 2025 um 01:30
3
Do you have a screenshot of what you are seeing in the dashboard?
If you can, please try sending the Sidekiq process the TTIN
signal and provide the backtrace here.
1 „Gefällt mir“
RGJ
(Richard - Communiteq)
20. Februar 2025 um 08:05
4
Sorry, took a while before this happened again.
sidekiq-clean.txt (35.8 KB)
Summary of the logs
[default] Thread TID-1ow77
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/sidekiq-6.5.12/lib/sidekiq/cli.rb:199:in `backtrace'
--
[default] Thread TID-1o1jr
[default] /var/www/discourse/lib/demon/base.rb:234:in `sleep'
--
[default] Thread TID-1o1j7
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/connection/ruby.rb:57:in `wait_readable'
--
[default] Thread TID-1o1j3
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/message_bus-4.3.8/lib/message_bus/timer_thread.rb:130:in `sleep'
--
[default] Thread TID-1o1ij AR Pool Reaper
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/connection_pool/reaper.rb:49:in `sleep'
--
[default] Thread TID-1o1hj
[default] <internal:thread_sync>:18:in `pop'
--
[default] Thread TID-1o1gz AR Pool Reaper
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/connection_pool/reaper.rb:49:in `sleep'
--
[default] Thread TID-1o1gv
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler/manager.rb:18:in `sleep'
--
[default] Thread TID-1o1gb
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler/manager.rb:32:in `sleep'
--
[default] Thread TID-1otmb
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otkn
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otjz
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1otif
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1othr
[default] /var/www/discourse/app/models/top_topic.rb:8:in `refresh_daily!'
--
[default] Thread TID-1o1fb
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/mini_scheduler-0.18.0/lib/mini_scheduler.rb:80:in `sleep'
--
[default] Thread TID-1o1er
[default] /var/www/discourse/lib/mini_scheduler_long_running_job_logger.rb:87:in `sleep'
--
[default] Thread TID-1o1en heartbeat
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/sidekiq-6.5.12/lib/sidekiq/launcher.rb:76:in `sleep'
--
[default] Thread TID-1o1e3 scheduler
[default] /var/www/discourse/vendor/bundle/ruby/3.3.0/gems/connection_pool-2.5.0/lib/connection_pool/timed_stack.rb:79:in `sleep'
--
[default] Thread TID-1ot4n processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot67 processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot8j processor
[default] /var/www/discourse/app/models/email_log.rb:58:in `unique_email_per_post'
--
[default] Thread TID-1ot5n processor
[default] /usr/local/lib/ruby/3.3.0/bundled_gems.rb:74:in `require'
--
[default] Thread TID-1ot6b processor
[default] /var/www/discourse/lib/distributed_mutex.rb:5:in `<main>'
--
[default] Thread TID-1o0kn final-destination_resolver_thread
[default] <internal:thread_sync>:18:in `pop'
--
[default] Thread TID-1o0k3 Timeout stdlib thread
[default] <internal:thread_sync>:18:in `pop'
RGJ
(Richard - Communiteq)
3. März 2025 um 07:57
5
@tgxworld did you have a chance to look at the backtrace?
Isambard
(Isambard)
3. März 2025 um 09:08
7
I have been having Sidekiq issues since a forum upgrade a month ago. What command do you use to restart Sidekiq? Just a sv restart sidekiq
?
tgxworld
(Alan Tan)
5. März 2025 um 01:17
8
Sorry I have not had a chance to take a look yet. Will try and get to it sometime this week.
1 „Gefällt mir“
I’m seeing this in the past few days. Eventually all jobs stop running. Previously I rebooted, but is it safe to delete the critical queue? Is it a redis queue?
I’m up-to-date at 3.5.0.beta1-dev.
Just a wild guess, but sometimes when I’m chatting with the bot it stops responding so I refresh the page or give up. Maybe those cases leave a job hanging?
1 „Gefällt mir“
RGJ
(Richard - Communiteq)
6. März 2025 um 20:21
10
These jobs are asynchronous so they wouldn’t even know that you did that.
It’s interesting to hear that you are having this on Jobs::BotInput
as well. We’re seeing this issue on only a small subset of all our servers (a few percent) and it seems to be the instances that use the narrative bot quite heavily.
No, you would lose all the other queued jobs as well.
The most easy and safe way is sv reload unicorn
from within the container.
1 „Gefällt mir“
That’s not the case with our forum. AI is only visible to staff and I’ve confirmed no staffers are using it.
I’ve disabled AI for now.
RGJ
(Richard - Communiteq)
7. März 2025 um 08:05
12
BotInput
is a job from the Discourse Narrative Bot (aka Discobot), not the AI bot.
Ah. I have been using the API heavily, as the username discobot.
tgxworld
(Alan Tan)
10. März 2025 um 08:36
14
I had a look at the backtraces and it all points to some problem with the following line:
# frozen_string_literal: true
require "distributed_mutex"
module DiscourseNarrativeBot
class NewUserNarrative < Base
I18N_KEY = "discourse_narrative_bot.new_user_narrative".freeze
BADGE_NAME = "Certified".freeze
TRANSITION_TABLE = {
begin: {
init: {
next_state: :tutorial_bookmark,
next_instructions:
Proc.new { I18n.t("#{I18N_KEY}.bookmark.instructions", base_uri: Discourse.base_path) },
action: :say_hello,
},
},
tutorial_bookmark: {
next_state: :tutorial_onebox,
This file has been truncated. show original
Not exactly sure why that line would cause problems though but it is a line that is not necessary so I’ve dropped it in
main
← remove_unnecessary_require
opened 08:32AM - 10 Mar 25 UTC
`DistributedMutex` is eager loaded in production/test and autoloaded in
develop… ment.
@RGJ Do you happen to have Rails.application.config.eager_load
set to disable for some reason?
2 „Gefällt mir“
RGJ
(Richard - Communiteq)
10. März 2025 um 08:58
16
Interesting find, thank you for looking into it.
It’s hard to tell when such an intermittent problem goes away. I have removed that line on the three instances that hung the most often (one of them almost daily). I will check back in here either:
when one of those instances hangs (we then know that this did not do the trick)
on Friday if none of them hung (we can then start assuming it was the solution)
Nope, didn’t mess with that.
But someone did…
Rails.autoloaders.main.do_not_eager_load(config.root.join("lib"))
at Blaming discourse/config/application.rb at main · discourse/discourse · GitHub
3 „Gefällt mir“
tgxworld
(Alan Tan)
14. März 2025 um 01:11
17
@loic Do you recall why we do not eager load the lib
directory even in production?
1 „Gefällt mir“
RGJ
(Richard - Communiteq)
14. März 2025 um 08:30
18
While the issues have been occuring this week, they haven’t been happening on the three instances where we removed that require
line, so I think we can safely assume that this is the culprit . Thank you for spotting that @tgxworld , I would have never found that.
Would you be able to backport that fix to stable?
loic
(Loïc Guitaut)
14. März 2025 um 08:50
19
It’s related to what’s explained here (when we upgraded to Rails 7.1): Upgrading Ruby on Rails — Ruby on Rails Guides
I don’t remember the exact problem, but we actually kept the previous behavior, having to require things from lib
.