Hi — let me restate this strictly based on runtime facts from the official Docker container.
What I’m seeing in the running container (facts)
This is an official Docker install with runit (standard /var/discourse launcher workflow; no rebuild right before the incident). Inside the container:
- A runit Sidekiq service exists and is the one being supervised
ls -l /etc/service/sidekiq/run
sv status sidekiq
Output during the incident:
down: sidekiq: 1s, normally up, want up
- Manual Sidekiq start works
cd /var/www/discourse
sudo -u discourse bundle exec sidekiq -C config/sidekiq.yml
This stays up, connects to Redis, and processes jobs.
- Patching only /etc/service/sidekiq/run (no rebuild) fixes the crash loop immediately Replaced /etc/service/sidekiq/run with:
#!/bin/bash
exec 2>&1
cd /var/www/discourse
mkdir -p tmp/pids
chown discourse:discourse tmp/pids || true
exec chpst -u discourse:discourse \
bash -lc 'cd /var/www/discourse && rm -f tmp/pids/sidekiq*.pid; exec bundle exec sidekiq -C config/sidekiq.yml'
After that:
sv status sidekiq
run: sidekiq: (pid <PID>) <SECONDS>s
So Sidekiq is not being launched via Unicorn master in this image; it’s a runit service whose runtime script can crash-loop.
Why you may not see the exact code in
discourse_docker
I agree the literal text may not be in the repo because /etc/service/sidekiq/run is a runtime artifact generated/injected during image build/boot, not necessarily a verbatim file in discourse_docker. But it is the active supervised service in this official image, as shown above.
What triggered the fragility (facts + minimal inference)
- We also observed daily logrotate failures due to standard Debian perms:/var/log = root:adm 0775, so logrotate refused rotation until adding global su root adm.
- When logrotate was failing, it recreated files under /shared/log/rails/, including sidekiq.log.
- The default runit script in this image used discourse:www-data and forced -L log/sidekiq.log into /shared/log, which makes Sidekiq very sensitive to shared-volume perms drift and can cause an immediate exit before useful logs.
Request / proposal
Given the above, could we consider hardening the default Docker/runit Sidekiq service?
Suggested defaults:
- run as discourse:discourse (matches typical ownership inside container),
- start via bundle exec sidekiq -C config/sidekiq.yml,
- avoid forcing a shared -L log/sidekiq.log (or make it resilient).
This would prevent the silent down: 1s crash loop that stops all background/AI jobs.
Happy to test any branch/commit you point me at.