Not to rain on this parade, but actually, from reading the posts, there is no solid confirmation that the Discourse backup process is causing the problem.
Why not confirm 100% that this problem is caused by a daily backup process? There are more than one processes running daily crontab files on hosts.
Did @pnoeric perform a du on the /var/discourse filesystem (outside the container)?
In your notes, @pnoeric writes:
root@x-app:/var/www/discourse# du -h -d 1
But this completed missed the Discourse shared directory including all the backups and uploads! and it misses all Docker files (and images) on the host (which can grow large if images are not pruned over time).
The place to run this check is outside the container (not in the container!):
For example (outside the container):
cd /var/discourse
/var/discourse# du -sh *
4.0K bin
4.0K cids
56K containers
12K discourse-doctor
24K discourse-setup
164K image
24K launcher
4.0K LICENSE
12K README.md
24K samples
8.0K scripts
62G shared
148K templates
You can see, on this host, the shared dir is 62G.
and also from /var of the filesystem (outside the container)
cd /var
# du -sh *
511M cache
20K composetest
62G discourse
1.6G docker
8.0K legacy
52G lib
4.0K local
0 lock
4.0K locks
5.7G log
24K logs
64K mail
4.0K opt
4.0K registry
4.0K shared
1.9M spool
48K tmp
25G linux_app
2.2G www
I’m not trying to rain on this parade, but before going out and proposing a lot of “fixes” to Discourse, it would be very good to actually confirm 100% sure that the Discourse backup cron is the actual problem.
We have had zero problem with the current Discourse backup process and in addition, managing the filesystem on the host is NOT a Discourse task per se.
Here:
du
Filesystem 1K-blocks Used Available Use% Mounted on
udev 32892500 0 32892500 0% /dev
tmpfs 6584232 2136 6582096 1% /run
/dev/md2 470927632 215969956 230966124 49% /
tmpfs 32921160 0 32921160 0% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
tmpfs 32921160 0 32921160 0% /sys/fs/cgroup
/dev/md0 482922 75082 382906 17% /boot
/dev/sda1 244988 4636 240353 2% /boot/efi
tmpfs 6584232 0 6584232 0% /run/user/1000
overlay 470927632 215969956 230966124 49% /var/lib/docker/overlay2/0f8be368b0154285423630ad50148ee2d5fdcb357c46125eafa7374ca34ef29a/merged
shm 524288 1620 522668 1% /var/lib/docker/containers/ca7b55fc5a0c123f7b2b1234ea210aa8286a34167cba9344b7929547bd323c9b/mounts/shm
overlay 470927632 215969956 230966124 49% /var/lib/docker/overlay2/7cd7e8b5b35b496eaed68753cc995e9303499a24721062055e2f06beb07e26c8/merged
shm 65536 0 65536 0% /var/lib/docker/containers/3cc0c90c3e3a5db6692e7b5d21727fbb1c13c8e07e48e4f6d954214fc03694a9/mounts/shm
overlay 470927632 215969956 230966124 49% /var/lib/docker/overlay2/31533fdf68033eed96dab4f9df89025ea3dab172ed48b6ce6431840a8df1c8ea/merged
shm 524288 0 524288 0% /var/lib/docker/containers/631fbabedda9a430dd8204ec66fb45c7514d948025124171b960ea424e28d5d4/mounts/shm
overlay 470927632 215969956 230966124 49% /var/lib/docker/overlay2/7a3ba2223ee93bc868b52b3707799d0fd7b4ca6dcc0df29f20c2c98a53903ff1/merged
shm 65536 0 65536 0% /var/lib/docker/containers/7a145366268c8ac5543a4555dc1bfc63c1e85a654e4c793e96fc2cc2e8514388/mounts/shm
overlay 470927632 215969956 230966124 49% /var/lib/docker/overlay2/add4bdd7bd88df7a0e05dff21896d3ef796f7cf2ff9759e0bb04b1953f16cd95/merged
shm 65536 0 65536 0% /var/lib/docker/containers/123743e122089b94660a6bdd2a9e55055ad91b6f75cce4ac760f36066bcf14d0/mounts/shm
overlay 470927632 215969956 230966124 49% /var/lib/docker/overlay2/b376ff32eaac0c58463e8b99b6db9ec0da3405c3f7a9f00b5430f10e07d372b0/merged
shm 524288 0 524288 0% /var/lib/docker/containers/63c52bc571b5f0d2544417da10efc37d3957e7a38f44bc8325145e795ee29559/mounts/shm
Let’s look at the Docker files:
# cd /var/lib
# du -sh docker
30G docker
and our Docker images are regularly pruned and cleaned up.
@bartv correct suggested to start here:
I’d start with figuring out which directory is blowing up. My standard approach is to enter /var/discourse and then run du -h -d 1
. Take the largest directory, enter it and repeat until you find the suspect. Once you have it, that might give you a clue to what’s going on.
That is a good start, but there can be a lot of other places on the host file system which can fill up the filesystem including Docker, core files, etc.
A graph showing a spike in percentage once a day is not enough to say, with authority, that the Discourse backup cron process is the root cause. It might be, but it might not be, based on the evidence so far!