Disk usage spike during backup, Discourse crashed hard :-(

This morning around 5:35am, my forums suddenly spiked on disk usage and crashed, going completely offline. I had to resize the Digital Ocean image to get them back on their feet. Oof.

Here’s my disk usage over the last 24 hours:

Question: what kind of logs/post-mortem analysis can I look at to try to figure out what the heck happened?! I checked the logs in the Discourse control panel but there are no clues there… they just end when the site crashed and then come right back when it went online.

1 Like

I’d start with figuring out which directory is blowing up. My standard approach is to enter /var/discourse and then run du -h -d 1. Take the largest directory, enter it and repeat until you find the suspect. Once you have it, that might give you a clue to what’s going on.

3 Likes

Maybe an auto-backup?

3 Likes

Yeah, backups are a common source of this kind of failure - what’s the disk usage look like over a 7 day window?

Also note that local uploads are included in these backups, so if you had a significant increase in uploads around 18:00, that would also increase the backup archive size.

5 Likes

Hmm. I have been transitioning files off of S3 and back to my local server, but the process to do that seems to run on the fly, and only a few hundred images (all around ~300k each) at a time = ~0.1 GB per batch. Over the last week, I might have run the script 20 times, so 20 batches = around 2 GB total of disk space. Which I had plenty of room for.

Is there any chance that even though the script appears to move them on the fly (downloading them from S3 and appearing to upload them immediately to Digital Ocean), there could also be some kind of lag for a queued job that would have kicked in at 530am, related to moving those images?

(Also: I was running these batches manually until 9pm, so as far as I know, the server was just doing normal operations from 9pm until 5:30am when it went down.)

Here’s my 7-day disk usage. It was climbing steadily from the images being imported, but you can see where it slammed up to 100% at 5:30am:

Are there any log files that might have some clues about what happened at 5:35am, besides the log files I see in the ‘Logs’ tab?

1 Like

Hmmm. My backups are set to go to S3, every 2 days, but nothing since the 9th?

Discourse ‘Backups’ view

Amazon S3 view

Incidentally, after seeing the above, I clicked the button in Discourse to trigger a backup. It took 28 minutes and seemed to work fine; I now see that .tar.gz file in both the Discourse and Amazon S3 views of my backups. So why would my auto backups not be triggering?! Arggggh.

I’m mystified… none of these are particularly large:

root@x-app:/var/www/discourse# du -h -d 1

3.5M	./lib
104K	./bin
8.0K	./.tx
148M    ./public
8.0K	./.bundle
14M     ./plugins
4.3M	./db
4.0K	./log
532M	./tmp
8.9M	./spec
17M     ./config
556M	./vendor
8.0K	./images
329M	./.git
2.0M	./script
80K	    ./docs
2.5M	./test
16K	    ./.github
17M	    ./app
1.6G	.

And even looking at overall disk space when inside the Docker container, it’s not as big as it was. I had an 80 GB Dig Ocean droplet, that’s what hit 100%. So then I resized it to 160 GB, doubling it. In theory, that means one of these should be at 50%, correct?

root@x-app:/var/www/discourse# df -h

Filesystem      Size  Used Avail Use% Mounted on
overlay         155G   58G   98G  38% /
tmpfs            64M     0   64M   0% /dev
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
shm             512M  2.6M  510M   1% /dev/shm
/dev/vda1       155G   58G   98G  38% /shared
tmpfs           3.9G     0  3.9G   0% /proc/acpi
tmpfs           3.9G     0  3.9G   0% /proc/scsi
tmpfs           3.9G     0  3.9G   0% /sys/firmware

You have spikes to almost 100% every night before — looks like this one tipped you over the edge. I expect the previous backups failed due to out-of-space while creating the local backup file to send to S3, but merely failed, and didn’t break your forum. You finally noticed when the out-of-space made postgresql unhappy (or redis, or whatever, it’s not really important) at just the right time to bring your forum down.

(With nearly 100GB of images on my server, I do Discourse scheduled backups without uploads, but with thumbnails. Then I do an offsite file-based offset backup of the backups directory first and the uploads directory second. I have tested this for recovery; it was the basis of a site migration I did last year. Storing 100GB tarballs every night would be crazy.)

7 Likes

Aha, so those little spikes are Discourse trying to make a backup! That sheds some light on things.

So here is my trailing 7-day chart again.

Maybe what we’re seeing is:

  1. Several times over this past week, Discourse attempted to make a backup. This process temporary eats a lot of disk space, and each time it tried, it ran out of space so none of those backups actually worked.

  2. Then when it tried yet again to make a backup last night, it got father along, but unfortunately crashed the site.

This makes some sense since the last successful backup was July 9. So then it waited 2 days (per my settings) and tried again July 11. That failed so it waited 24 hours and tried again the 12th, 13th, and the fatal retry on the 14th.

If that’s what happened, I’d love to see:

  1. Better notification from Discourse when a backup fails

  2. Perhaps Discourse should automatically “fail” a backup (creating a notification) if, when it starts, there is less than x% (10%?) free disk space. So it doesn’t even start if disk space is already tight.

BTW if this really is what happened, then looking at the first failed backup, on July 11, shows there was ~40% free disk space (which would have been ~32GB!!!) but that wasn’t enough for the backup to complete successfully. Is that right?! Why would Discourse need what seems like an inordinate amount of scratch/working space when producing a backup?

2 Likes

It didn’t necessarily get further along last night; you just “lost a race” — what happens when you run out of space depends on what component first is affected by the problem.

If it fails to make a backup, I rather expect it might try to send a message, but if it’s out of disk space, it might not succeed. :scream:

A fixed percentage doesn’t really tell you much; the database might be tiny compared to uploads, or vice versa, and there are variables of whether thumbnails are included and whether uploads are included. I could see a configurable free space requirement so you could tune for your site, I guess.

I don’t know how you are judging “inordinate” — it doesn’t strike me as inordinate.

2 Likes

Fair enough; as you point out, there are many variables in play.

Oh, the “right” way to do it would to be computing the amount of space it thinks it’ll need for the backup, etc, etc. But to keep it super simple, yeah, just a flat %. I’m just thinking… if the two choices are “your site might crash completely and go offline” or “here is a not-perfect-but-quickie fix for the problem,” I’ll take the latter, thanks. :wink:

And speaking of thanks, thank YOU for all your help with the migration stuff and with your thoughts on this. :+1:t2:

1 Like

Estimating space required for backup is one of the hard problems in computer science… It’s a distant relative of the progress bar. :wink:

In all seriousness, part of it is a database dump, and I don’t know how you would estimate that ahead of time. If you have enough images that space becomes an issue, including them in backup archives is probably outside the mainstream practice.

Typically, when it comes to system administration, free space monitoring and backup health have been an administrative burden rather than an application burden. This is part of what folks are paying for when they pay CDCK to host their Discourse.

There are plenty of other ways to run out of space. I know you are focused on the one that bit you, but the problem is more general, and I think that this is more normally addressed as administrative overhead.

4 Likes

Not to rain on this parade, but actually, from reading the posts, there is no solid confirmation that the Discourse backup process is causing the problem.

Why not confirm 100% that this problem is caused by a daily backup process? There are more than one processes running daily crontab files on hosts.

Did @pnoeric perform a du on the /var/discourse filesystem (outside the container)?

In your notes, @pnoeric writes:

root@x-app:/var/www/discourse# du -h -d 1

But this completed missed the Discourse shared directory including all the backups and uploads! and it misses all Docker files (and images) on the host (which can grow large if images are not pruned over time).

The place to run this check is outside the container (not in the container!):

For example (outside the container):

cd /var/discourse 
/var/discourse# du -sh *
4.0K	bin
4.0K	cids
56K	containers
12K	discourse-doctor
24K	discourse-setup
164K	image
24K	launcher
4.0K	LICENSE
12K	README.md
24K	samples
8.0K	scripts
62G	shared
148K	templates

You can see, on this host, the shared dir is 62G.

and also from /var of the filesystem (outside the container)

cd /var
# du -sh *
511M	cache
20K	composetest
62G	discourse
1.6G	docker
8.0K	legacy
52G	lib
4.0K	local
0	lock
4.0K	locks
5.7G	log
24K	logs
64K	mail
4.0K	opt
4.0K	registry
4.0K	shared
1.9M	spool
48K	tmp
25G	 linux_app
2.2G	www

I’m not trying to rain on this parade, but before going out and proposing a lot of “fixes” to Discourse, it would be very good to actually confirm 100% sure that the Discourse backup cron is the actual problem.

We have had zero problem with the current Discourse backup process and in addition, managing the filesystem on the host is NOT a Discourse task per se.

Here:

du

Filesystem     1K-blocks      Used Available Use% Mounted on
udev            32892500         0  32892500   0% /dev
tmpfs            6584232      2136   6582096   1% /run
/dev/md2       470927632 215969956 230966124  49% /
tmpfs           32921160         0  32921160   0% /dev/shm
tmpfs               5120         0      5120   0% /run/lock
tmpfs           32921160         0  32921160   0% /sys/fs/cgroup
/dev/md0          482922     75082    382906  17% /boot
/dev/sda1         244988      4636    240353   2% /boot/efi
tmpfs            6584232         0   6584232   0% /run/user/1000
overlay        470927632 215969956 230966124  49% /var/lib/docker/overlay2/0f8be368b0154285423630ad50148ee2d5fdcb357c46125eafa7374ca34ef29a/merged
shm               524288      1620    522668   1% /var/lib/docker/containers/ca7b55fc5a0c123f7b2b1234ea210aa8286a34167cba9344b7929547bd323c9b/mounts/shm
overlay        470927632 215969956 230966124  49% /var/lib/docker/overlay2/7cd7e8b5b35b496eaed68753cc995e9303499a24721062055e2f06beb07e26c8/merged
shm                65536         0     65536   0% /var/lib/docker/containers/3cc0c90c3e3a5db6692e7b5d21727fbb1c13c8e07e48e4f6d954214fc03694a9/mounts/shm
overlay        470927632 215969956 230966124  49% /var/lib/docker/overlay2/31533fdf68033eed96dab4f9df89025ea3dab172ed48b6ce6431840a8df1c8ea/merged
shm               524288         0    524288   0% /var/lib/docker/containers/631fbabedda9a430dd8204ec66fb45c7514d948025124171b960ea424e28d5d4/mounts/shm
overlay        470927632 215969956 230966124  49% /var/lib/docker/overlay2/7a3ba2223ee93bc868b52b3707799d0fd7b4ca6dcc0df29f20c2c98a53903ff1/merged
shm                65536         0     65536   0% /var/lib/docker/containers/7a145366268c8ac5543a4555dc1bfc63c1e85a654e4c793e96fc2cc2e8514388/mounts/shm
overlay        470927632 215969956 230966124  49% /var/lib/docker/overlay2/add4bdd7bd88df7a0e05dff21896d3ef796f7cf2ff9759e0bb04b1953f16cd95/merged
shm                65536         0     65536   0% /var/lib/docker/containers/123743e122089b94660a6bdd2a9e55055ad91b6f75cce4ac760f36066bcf14d0/mounts/shm
overlay        470927632 215969956 230966124  49% /var/lib/docker/overlay2/b376ff32eaac0c58463e8b99b6db9ec0da3405c3f7a9f00b5430f10e07d372b0/merged
shm               524288         0    524288   0% /var/lib/docker/containers/63c52bc571b5f0d2544417da10efc37d3957e7a38f44bc8325145e795ee29559/mounts/shm

Let’s look at the Docker files:

# cd /var/lib
# du -sh docker
30G	docker

and our Docker images are regularly pruned and cleaned up.

@bartv correct suggested to start here:

I’d start with figuring out which directory is blowing up. My standard approach is to enter /var/discourse and then run du -h -d 1 . Take the largest directory, enter it and repeat until you find the suspect. Once you have it, that might give you a clue to what’s going on.

That is a good start, but there can be a lot of other places on the host file system which can fill up the filesystem including Docker, core files, etc.

A graph showing a spike in percentage once a day is not enough to say, with authority, that the Discourse backup cron process is the root cause. It might be, but it might not be, based on the evidence so far!

6 Likes

This is great. I’ll try all the stuff you mentioned. Thank you.

1 Like

Yep, that’s obviously a backup.

Nah, there’s plenty of confirmation: spikes are on a 2 day interval with one exception, and the backup frequency is set to 2 days. Past experience on Meta has shown exactly this failure mode, too.

Yup, this is a solid plan for moving forwards. The first recommendation for people who start to hit the disk space limit on their VPS is off-machine upload storage with the S3 mechanism, though.

8 Likes

Since @pnoeric is trying to move off S3 for images, storing multiple copies of all the images in a backup that’s in S3 wouldn’t accomplish the purpose of moving off S3. @pnoeric this does confuse me — if you want to move off S3 but only move a fraction of the files off because you store all the images on S3 in multiple copies of backups, what is the point?

In any case, I was trying to show what alternatives are like. Backup is hard, especially if you ever want to be able to restore from the backup.

I moved off “S3” (digital ocean spaces in my case) after having enough server space and not tremendous growth or traffic making it make sense to be on “S3”, but I’m unusual, which is probably why I never got a word of review on my PR that resolves data corruption on migration off S3. :stuck_out_tongue: So I expect my backup regime to be highly unusual.

4 Likes

My situation is that I have a lot of images which means a lot of of transfer bandwidth as people see those images… so when images were living on Amazon S3, the bandwidth bill is really what killed me. Especially when I realized I could store all the images on the DO droplet and it would be included in the bandwidth/storage fees I already pay. (At some point in the future, it might make sense to move things back to S3, or it might also make more sense to just increase my DO droplet again, then…)

So I started with S3, then realized my error. Thus my current situation, using your excellent code to migrate all the images from S3 back to DO.

Keeping a full backup (images and all) on S3 is a totally different story-- it’s in “cold storage” on S3 and not accessed unless there’s a problem. So no big bandwidth bills.

Also: I was thinking more about the backup/disk use situation. I still maintain there’s something missing here. Maybe it’s just a warning message or better documentation. But my Discourse was using just 60% of the disk, and my off-site backup were failing. Some kind of estimate of disk space needed, or warning if there isn’t disk space, or something seems like it would be better than what happens now when there’s not enough room, which is: no backup for several days, followed by a hard crash that took the forums completely offline. :-\

(@riking even said “backups are a common source of this kind of failure.” So Discourse instances are regularly crashing because backups are failing without warning of a potential problem?)

Another way to say it, very simply, at the 30,000-foot view: it seems like a design flaw if a basic feature of the software (auto backups) can take the whole thing offline. Especially when we’re talking about a feature that just uses disk space to prepare the backup, not even store it on the same disk.

1 Like

No, he meant taking backups via any software on any server can potentially fill the disk and cause issues.

3 Likes

Sure, but that’s why you front S3 with a CDN. Don’t serve images directly from S3, that’s going to be ridiculously expensive :scream:. You can front S3 with Cloudfront or even CloudFlare quite easily. The free tier of CloudFlare will achieve this.

And storing them locally is also pretty bad news, you’re going to need to scale up your VPS unnecessarily. Local SSD will be much costlier.

7 Likes

Ah, ok, got it.

So how can I tell how much disk space might be needed for Discourse to prepare a backup? The software isn’t telling me, so maybe tomorrow it’ll be 500 GB and it will take my Digital Ocean server offline again. :man_shrugging:t2: At least if I can do some back-of-the-envelope math, I can try to stay on top of it.

Oh wow, that’s a great idea. Never thought of that. So I would apply the CDN to my Amazon bucket, and then tell Discourse to use S3 for all assets? (Like I had before? lol)

3 Likes