Sidekiq runit script too fragile

Hi team,

reporting a failure mode in the official Docker/runit setup that can silently kill Sidekiq (and therefore AI / background jobs) without any rebuild or upgrade.

Environment

  • Official Discourse Docker install (standard container + runit services).
  • No rebuild/upgrade right before the issue started.
  • Discourse AI plugin enabled, but AI stopped replying.

Symptoms

  • AI looks enabled in admin UI, but no AI replies appear.
  • Background jobs (AI/embeddings/auto-reply) appear stuck.
  • sv status sidekiq shows Sidekiq repeatedly dying right after start:
down: sidekiq: 1s, normally up, want up
  • Manually starting Sidekiq works fine, so the app itself is OK:
bundle exec sidekiq -C config/sidekiq.yml
# stays up, connects to Redis, processes jobs

What we found

The default runit script was:

exec chpst -u discourse:www-data \
  bash -lc 'cd /var/www/discourse && ... bundle exec sidekiq -e production -L log/sidekiq.log'

Two fragility points:

  1. Primary group www-data In my container, typical writable paths are owned by discourse:discourse. Any drift in tmp/pids or shared paths can make Sidekiq exit during boot when run under www-data, even though manual start as discourse works.
  2. Forced -L log/sidekiq.log writing to shared logs The log path is a symlink into /shared/log/rails/sidekiq.log. If that file/dir gets recreated with different ownership/permissions, Sidekiq can exit immediately before producing useful logs.

Related trigger: logrotate failing daily

Separately, logrotate was failing every day with:

error: skipping "...log" because parent directory has insecure permissions
Set "su" directive in config file ...

Cause was standard Debian/Ubuntu perms:

  • /var/log is root:adm with 0775 (group writable).
  • logrotate refuses rotation unless a global su directive is set.This is expected upstream behavior.

At the moment the daily logrotate job failed, it also recreated files under /shared/log/rails/ (including sidekiq.log), which likely interacted with the forced -L logging and contributed to the Sidekiq “1s crash” loop.

Fix (no rebuild needed)

  1. Fix logrotate so it stops touching shared logs in a failed state Add a global su directive:
# /etc/logrotate.conf (top)
su root adm

After that, logrotate -v exits 0 and no longer reports insecure parent perms.

  1. Replace Sidekiq runit script with a more robust default Switching to discourse:discourse and the standard sidekiq.yml, and not forcing -L log/sidekiq.log, makes Sidekiq stable:
#!/bin/bash
exec 2>&1
cd /var/www/discourse

mkdir -p tmp/pids
chown discourse:discourse tmp/pids || true

exec chpst -u discourse:discourse \
  bash -lc 'cd /var/www/discourse && rm -f tmp/pids/sidekiq*.pid; exec bundle exec sidekiq -C config/sidekiq.yml'

After this:

  • sv status sidekiq stays run:
  • AI/background jobs resume.

Request / suggestion

Could we consider making the official Docker/runit Sidekiq service more robust by default?

For example:

  • Run Sidekiq under discourse:discourse (matching typical ownership inside container).
  • Prefer bundle exec sidekiq -C config/sidekiq.yml.
  • Avoid forcing a shared log file via -L log/sidekiq.log, or make it resilient to logrotate/shared-volume perms drift.

Even a doc note (“if Sidekiq shows down: 1s but manual start works, check /etc/service/sidekiq/run and avoid forced shared logging”) would help self-hosters a lot.

Happy to provide more logs if needed. Thanks!

לייק 1

Where are you finding that? Sidekiq is launched via the unicorn master to conserve memory. Not seeing this code at all in discourse_docker. Looks like maybe you are using a very old setup?

2 לייקים

Hi — let me restate this strictly based on runtime facts from the official Docker container.

What I’m seeing in the running container (facts)

This is an official Docker install with runit (standard /var/discourse launcher workflow; no rebuild right before the incident). Inside the container:

  1. A runit Sidekiq service exists and is the one being supervised
ls -l /etc/service/sidekiq/run
sv status sidekiq

Output during the incident:

down: sidekiq: 1s, normally up, want up
  1. Manual Sidekiq start works
cd /var/www/discourse
sudo -u discourse bundle exec sidekiq -C config/sidekiq.yml

This stays up, connects to Redis, and processes jobs.

  1. Patching only /etc/service/sidekiq/run (no rebuild) fixes the crash loop immediately Replaced /etc/service/sidekiq/run with:
#!/bin/bash
exec 2>&1
cd /var/www/discourse
mkdir -p tmp/pids
chown discourse:discourse tmp/pids || true
exec chpst -u discourse:discourse \
  bash -lc 'cd /var/www/discourse && rm -f tmp/pids/sidekiq*.pid; exec bundle exec sidekiq -C config/sidekiq.yml'

After that:

sv status sidekiq
run: sidekiq: (pid <PID>) <SECONDS>s

So Sidekiq is not being launched via Unicorn master in this image; it’s a runit service whose runtime script can crash-loop.

Why you may not see the exact code in

discourse_docker

I agree the literal text may not be in the repo because /etc/service/sidekiq/run is a runtime artifact generated/injected during image build/boot, not necessarily a verbatim file in discourse_docker. But it is the active supervised service in this official image, as shown above.

What triggered the fragility (facts + minimal inference)

  • We also observed daily logrotate failures due to standard Debian perms:/var/log = root:adm 0775, so logrotate refused rotation until adding global su root adm.
  • When logrotate was failing, it recreated files under /shared/log/rails/, including sidekiq.log.
  • The default runit script in this image used discourse:www-data and forced -L log/sidekiq.log into /shared/log, which makes Sidekiq very sensitive to shared-volume perms drift and can cause an immediate exit before useful logs.

Request / proposal

Given the above, could we consider hardening the default Docker/runit Sidekiq service?

Suggested defaults:

  • run as discourse:discourse (matches typical ownership inside container),
  • start via bundle exec sidekiq -C config/sidekiq.yml,
  • avoid forcing a shared -L log/sidekiq.log (or make it resilient).

This would prevent the silent down: 1s crash loop that stops all background/AI jobs.

Happy to test any branch/commit you point me at.

Again … I am confused about where you are getting your image from:

image

This is the official image.

This is a search for the word sidekiq in the official discourse docker.

https://github.com/search?q=repo%3Adiscourse%2Fdiscourse_docker%20sidekiq&type=code

There are 3 hits… nothing about a runit unit. It is managed via unicorn.