Sidekiq keeps restarting, how to troubleshoot?

Hello,

We have been running a two-container, multi-domain Discourse server for ~4 yrs and host ~20 domains. We performed regular updates with success. However, we realized at the beginning of October (starting around oct. 8-10th), probably following a Discourse update that sign-up emails were not send. We noticed that the sidekiq task does not run, and that Sidekiq keeps restarting.

The only difference with regular migrations we perform is that that this time I had to manually alter all the Postgres databases to activate the latest vector extension ; it seems the upgrade script would only do it on the main, discourse database.

Symptoms:

  1. Logs show that Sidekiq restarts every few seconds

  1. The restart is associated with the following error message:
/var/www/discourse/lib/demon/sidekiq.rb:31:in `heartbeat_check'
config/unicorn.conf.rb:131:in `block (2 levels) in reload'
E, [2025-11-01T11:56:05.989645 #67] ERROR -- : reaped #<Process::Status: pid 6534 SIGKILL (signal 9)> worker=unknown
I, [2025-11-01T11:56:41.468169 #7038]  INFO -- : Loading Sidekiq in process id 7038
W, [2025-11-01T11:57:20.944092 #67]  WARN -- : Process would not terminate cleanly, force quitting. pid: 7038 Demon::Sidekiq
/var/www/discourse/lib/demon/base.rb:94:in `restart'
/var/www/discourse/lib/demon/sidekiq.rb:40:in `block in heartbeat_check'
/var/www/discourse/lib/demon/sidekiq.rb:31:in `each'
/var/www/discourse/lib/demon/sidekiq.rb:31:in `heartbeat_check'
  1. The “sidekiq view” does not seem to process jobs

  1. UI shows some warnings that sidekiq is not running properly:A check for updates has not been performed. Ensure Sidekiq is running.

Here is what I tried:

  • rebuilding (no error)
  • flushing Redis queue (works, Sidekiq dashboard goes back to zero but tasks are still not processed)
  • checked redis version in the data container (redis version: 7.0.15)
  • checking is Sidekiq is paused (it is not)
  • skimming through the logs in shared/web-only/log, but I could not find anything relevant, although extra pointers are welcome!
  • Tried to activate Sidekiq logs by setting DISCOURSE_LOG_SIDEKIQ: 1 in web_only.yml followed by ./launcher stop web_only && ./launcher destroy web_only && ./launcher start web_only, and the log shows only success messages such as:
{"hostname":"forum-web-only","pid":12961,"database":"chatonnade","job_id":null,"job_name":"Jobs::DiscourseAutomation::StalledWikiTracker","job_type":"scheduled","opts":"{}","status":"success","live_slots_start":1298445,"duration":0.04405494895763695,"sql_duration":0.03392060892656446,"sql_calls":1,"redis_duration":0,"redis_calls":0,"net_duration":0,"net_calls":0,"live_slots_finish":1299663,"live_slots":1218,"@timestamp":"2025-11-01T12:17:32.561+00:00"}

I am running out of ideas about what I could do to pinpoint the issue. Where can I look for a meaningful error message?

Many thanks !

I noticed the same thing but it’s not as frequent. Been working fine for years but of late I see it happen about once a month. I’ve even double the RAM on the instance last year to keep up with the discourse upgrades.

Message

Sidekiq heartbeat test failed for 2087, restarting

Backtrace

/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/activesupport-8.0.4/lib/active_support/broadcast_logger.rb:218:in `block in dispatch'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/activesupport-8.0.4/lib/active_support/broadcast_logger.rb:217:in `map'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/activesupport-8.0.4/lib/active_support/broadcast_logger.rb:217:in `dispatch'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/activesupport-8.0.4/lib/active_support/broadcast_logger.rb:129:in `warn'
/var/www/discourse/lib/demon/sidekiq.rb:39:in `block in heartbeat_check'
/var/www/discourse/lib/demon/sidekiq.rb:31:in `each'
/var/www/discourse/lib/demon/sidekiq.rb:31:in `heartbeat_check'
config/unicorn.conf.rb:131:in `block (2 levels) in reload'
1 Like

Thanks! After hours of digging my best bet is that Sidekiq restarts when the server is under stress (IO, CPU, RAM), but I could not pinpoint more clearly (no logs, no OOM).

The frequency of restart since moved from once a minute to once every ~10 minutes (thus allowing the processing of the queue), then to even less.