When backup fails, delete the useless backup

If a site backup fails due to disk space when it’s gzipping the backup, the .tar file is left. Discourse can’t see it or use the tar file. On one hand, a decent sysadmin would be alarmed enough at a failed backup and immediately go solve the disk space problem and then gzip the backup by hand in a shell. On the other hand, someone who doesn’t like getting their hands dirty in a shell is sort of out of luck.

As an aside, it would seem like 50GB would be a reasonable partition size for a site with a 13GB backup, but since there are two copies of the current backup while it’s gzipping, and maximum backups doesn’t delete a backup until there are more than maximum backups, 50GB is enough for maximum backups to hold only one backup. It took me quite a while to understand that math.

9 Likes

Try a database-only backup, which skips the “combine all the uploaded files into the database archive” step and thus doesn’t need 2x the disk space in the process.

1 Like

It this isn’t happening this is a bug. I’m pretty sure we have code to handle cleanup on failure. Which folder is the backup left in? Is it the tmp folder?

2 Likes

It’s on the same folder where finished backups resides, /shared/standalone/backups/default/.

Hmm that is strange… the entire backup process should take place in a tmp folder before being moved to the backups folder. So if anything blows up, it’ll clean up the tmp folder after. Maybe we’ll not catching the error when gzip blows up some how. I’ll have a :eyes:

Well, the site is up to date.

No. And I thought for a while that perhaps the problem was that it was writing to /tmp and I’ve got a whole separate partition just for backups, so now the site doesn’t crash when the backup fills the disk, but . . . what @falco just said. It could be complicated by having backups somewhere else like this:

 - volume:
      host: /mnt/backups
      guest: /shared/backups

If you’ll point me to the file where the script is (about 30% of the time it’s exactly where I think it’ll be) I’ll check & if I can figure it out (and unless it’s something bizarre, finding the script should be 90% of the problem for me) I’ll submit a PR.

3 Likes

Sure, we want to fix this, but we need a proper repro of the issue with very careful and consistent steps.

I believe this somehow relates to running out of disk space, and maybe followed by a server crash, forcing us to add cleanup code either at boot or when you run next backup.

6 Likes

Bump.

The .tar and the final tar.gz both appear in /var/discourse/shared..../backup while it is running.

The tar.gz is visible in the web interface at /admin/backups while the uploads are being added to it (and it’s size increments up).

When it runs out of space the tar.gz disappears from /admin/backups, but the .tar file is still, and the space is not returned (this 4 hours after the backup failed).

4 Likes