this shows the maximum number of user processes/threads allowed
ulimit -a
this shows all resource limits
cat /sys/fs/cgroup/pids.max
This checks the maximum number of processes (PIDs) allowed for the container or system cgroup.
now use logout to return to the host;
systemctl show docker | grep TasksMax
this checks whether systemd has imposed a task/thread limit on the Docker service.
systemctl show containerd | grep TasksMax
this does the same kind of check, but for the containerd service instead of Docker directly.
docker inspect app | grep -i pid
this checks the process / PID limits and settings of your Discourse container. The grep -i pid: filters to anything containing “pid” (case-insensitive).
If you keep getting errors, please could you paste the output of these commands here, that would be helpful.
Doing a rebuild from the CLI appears to have fixed it. Will keep an eye on it. Something about doing a browser update from the beta to stable in the last week triggered this.
Should there be limits on the browser upgrade ? Can the browser upgrade detect potential issue and flag it or prevent the upgrade from being triggered ?
The rebuild likely reset the container’s cgroup placement, which would explain why it’s stable again.
Given the original can’t alloc thread errors and the fact that everything else (ulimits, TasksMax, Docker PIDs) is unlimited, the remaining suspect is PID cgroup pressure.
If pids.current is approaching ~2000+ against a max of ~2285, that would confirm the container was hitting the cgroup PID ceiling during the scheduler / Redis reconnect bursts.
That would also explain why the issue only appeared after the upgrade (higher thread churn), and why the rebuild temporarily cleared it.
How many processes (PIDs/threads) are currently running inside the container/cgroup ↩︎
the maximum number of processes (PIDs/threads) allowed in that cgroup (your container) ↩︎
Thanks - that rules out current PID pressure after the rebuild.
Given pids.current is only 227 against 4194304, the container is definitely not near the cgroup PID ceiling now.
The interesting bit is that the earlier pids.max appeared to be 2285, whereas after the rebuild it is now 4194304. So the rebuild may have reset the cgroup limit (i.e. renewed the app-container), so that explains why there is no evidence of an active PID exhaustion problem at the moment.
If the error comes back, it would be useful to copy-and-paste these at the time it is failing:
cat /sys/fs/cgroup/pids.current
cat /sys/fs/cgroup/pids.max
ps -eLf | wc -l
That would show whether the failure is due to PID pressure, or whether we should look elsewhere