Sidekiq runit script too fragile

Hi team,

reporting a failure mode in the official Docker/runit setup that can silently kill Sidekiq (and therefore AI / background jobs) without any rebuild or upgrade.

Environment

  • Official Discourse Docker install (standard container + runit services).
  • No rebuild/upgrade right before the issue started.
  • Discourse AI plugin enabled, but AI stopped replying.

Symptoms

  • AI looks enabled in admin UI, but no AI replies appear.
  • Background jobs (AI/embeddings/auto-reply) appear stuck.
  • sv status sidekiq shows Sidekiq repeatedly dying right after start:
down: sidekiq: 1s, normally up, want up
  • Manually starting Sidekiq works fine, so the app itself is OK:
bundle exec sidekiq -C config/sidekiq.yml
# stays up, connects to Redis, processes jobs

What we found

The default runit script was:

exec chpst -u discourse:www-data \
  bash -lc 'cd /var/www/discourse && ... bundle exec sidekiq -e production -L log/sidekiq.log'

Two fragility points:

  1. Primary group www-data In my container, typical writable paths are owned by discourse:discourse. Any drift in tmp/pids or shared paths can make Sidekiq exit during boot when run under www-data, even though manual start as discourse works.
  2. Forced -L log/sidekiq.log writing to shared logs The log path is a symlink into /shared/log/rails/sidekiq.log. If that file/dir gets recreated with different ownership/permissions, Sidekiq can exit immediately before producing useful logs.

Related trigger: logrotate failing daily

Separately, logrotate was failing every day with:

error: skipping "...log" because parent directory has insecure permissions
Set "su" directive in config file ...

Cause was standard Debian/Ubuntu perms:

  • /var/log is root:adm with 0775 (group writable).
  • logrotate refuses rotation unless a global su directive is set.This is expected upstream behavior.

At the moment the daily logrotate job failed, it also recreated files under /shared/log/rails/ (including sidekiq.log), which likely interacted with the forced -L logging and contributed to the Sidekiq “1s crash” loop.

Fix (no rebuild needed)

  1. Fix logrotate so it stops touching shared logs in a failed state Add a global su directive:
# /etc/logrotate.conf (top)
su root adm

After that, logrotate -v exits 0 and no longer reports insecure parent perms.

  1. Replace Sidekiq runit script with a more robust default Switching to discourse:discourse and the standard sidekiq.yml, and not forcing -L log/sidekiq.log, makes Sidekiq stable:
#!/bin/bash
exec 2>&1
cd /var/www/discourse

mkdir -p tmp/pids
chown discourse:discourse tmp/pids || true

exec chpst -u discourse:discourse \
  bash -lc 'cd /var/www/discourse && rm -f tmp/pids/sidekiq*.pid; exec bundle exec sidekiq -C config/sidekiq.yml'

After this:

  • sv status sidekiq stays run:
  • AI/background jobs resume.

Request / suggestion

Could we consider making the official Docker/runit Sidekiq service more robust by default?

For example:

  • Run Sidekiq under discourse:discourse (matching typical ownership inside container).
  • Prefer bundle exec sidekiq -C config/sidekiq.yml.
  • Avoid forcing a shared log file via -L log/sidekiq.log, or make it resilient to logrotate/shared-volume perms drift.

Even a doc note (“if Sidekiq shows down: 1s but manual start works, check /etc/service/sidekiq/run and avoid forced shared logging”) would help self-hosters a lot.

Happy to provide more logs if needed. Thanks!