Add option to disable backup compression


#5

@mpalmer The --rsyncable option helps a lot, and should probably just be turned on for everyone.

The cost of --rsyncable

7 daily backups created by discourse:

  • Total with current gzip options (none): 306294397 bytes
  • Total after decompression and recompression with gzip --rsyncable: 307156949 bytes

0.03% is sufficiently negligible. :slightly_smiling:

The benefit of --rsyncable

For those same 7 tarballs, with default gzip, both borg and tarsnap are unable to deduplicate the data. (Savings is <1%)

And with gzip --rsyncable:

  • borg is able to deduplicate the 7 tarballs to 1.187x
  • tarsnap is able to deduplicate the 7 tarballs to 1.165x

(x = average size of single tarball)

The case for having Discourse create uncompressed backups

TLDR: Not a huge win.

Uncompressing the tarballs increases their size on disk by ~51%. I have to assume this would vary for other installations based on what percentage of the tarball is sql/text vs already-compressed uploads. (Uploads comprise ~60% of my tarballs, measured with du when extracted.)

Both borg and tarsnap deduplicate a bit better with an uncompressed tarball than with --rsyncable:

  • With borg set to do no compression of its own, the 7 uncompressed tarballs deduplicate to only 77.25MB.
  • With borg set to do chunk-wise zlib compression, that shrinks to 47.29 MB.
  • Tarsnap always does chunk-wise compression, totaling 46.43 MB.

For backups of all of /var/discourse (including the 7 tarballs, but also non-tarred copies of the uploads, and postgres’ files on disk, etc…):

  • /var/discourse with tarballs compressed with gzip defaults, tarsnap compressed total: 371428106 bytes
  • /var/discourse with tarballs compressed with gzip --rsyncable, tarsnap compressed total: 114474052 bytes
  • /var/discourse with uncompressed tarballs, tarsnap compressed total: 96140976 bytes

Because the savings between the last two is less than half the size of my uploads (as extracted from a tarball), I’m assuming that deduplication of uploads between on-disk and in-tarball is not happening, which is the main thing I hoped to gain by having uncompressed tarballs.

So, forget about the uncompressed option, at least for now, but please do turn on --rsyncable!


(Sam Saffron) #6

Sure, do a PR for this, excellent research!!! :microscope:


#7

Sorry—if it’s going to happen the commit can’t come from me. CLA is a no-go. :pensive:


(Sam Saffron) #8

np, @zogstrip can you add this?


(Dean Taylor) #9

I completed a quick test to see the possible impact of this change for the largest Discourse instance I have to hand.

root@forum:/var/discourse/shared/standalone/backups/default# time -p sh -c "gzip -cd example-com-2016-01-11-033541.tar.gz | gzip > example-com-2016-01-11-033541-recompressed-nochange.tar.gz"
real 671.66
user 637.36
sys 60.42
root@forum:/var/discourse/shared/standalone/backups/default# time -p sh -c "gzip -cd example-com-2016-01-11-033541.tar.gz | gzip --rsyncable > example-com-2016-01-11-033541-recompressed-resyncable.tar.gz"
real 748.85
user 716.44
sys 61.82
root@forum:/var/discourse/shared/standalone/backups/default# ls -l
-rw-r--r-- 1 root root     8712395243 Jan 11 00:01 example-com-2016-01-11-033541-recompressed-nochange.tar.gz
-rw-r--r-- 1 root root     8726238790 Jan 11 00:14 example-com-2016-01-11-033541-recompressed-resyncable.tar.gz
-rw-r--r-- 1 1000 www-data 8712395272 Jan 10 22:41 example-com-2016-01-11-033541.tar.gz

By adding --rsyncable it took an additional 77.19 seconds to compress the file and increased output file size by 13,843,547 bytes or approximately 13.84 MB for no real gain for me.

Perhaps not worrisome as this is background process - just thought is was worth testing.


(Sam Saffron) #10

Good point then, this needs to be an opt-in option then via a site setting.


(Matt Palmer) #11

Thanks @Koinu for doing the initial testing, and @DeanMarkTaylor for the extra data.

It’s important to note that it’s 13.84MB on an 8,712.40MB archive – an increase of 0.15%. That really is down in the noise, IMO. The increase in time is more worrying: it’s somewhere in the vicinity of a 10% increase, which, while it’s a background process, is still a fair chunk more time. I’m honestly quite surprised at the increase in time taken, I didn’t think it had that much impact. Some hand-wavey tests on a local file here seem to support the time increase you saw.

I’m not a fan of adding a site setting for something like this. People will not turn it on when they should, or turn it on when they shouldn’t… I don’t think it’ll be a good outcome. Despite the increase in time, I’m inclined to turn it on, because it does have some significant benefits for anyone doing dedup (or any other delta-observant process, like, say, rsync) and shouldn’t have any observable impact on other people.


#12

The concern about time just reminded me that pigz exists. It even has --rsyncable!

(pigz produces output that’s compatible with gzip, using multiple cores to do the compression and resulting in much shorter wall clock time vs ordinary gzip.)

It’d probably be a good idea to run it niced into oblivion, since its many threads could have a much bigger impact on a running instance’s peformance than gzip’s one thread.

Might be worth checking out, especially for that large instance. :slight_smile:

Edit:

I did some quick tests on my not-so-huge backup.

At the default compression level, pigz --rsyncable output is very slightly bigger than gzip --rsyncable output. It deduplicates just as well.

The size cost of pigz --rsyncable is a ~0.736% gain versus plain gzip for my data. (For comparison, gzip --rsyncable is a ~0.281% gain.)

The wall clock time improvement is impressively linear. With 4 real cores on my laptop, pigz --rsyncable takes ~0.568 seconds (wall clock time), compared to ~2.222 seconds for plain gzip.

There may be a sweet spot where you can “have your cake and eat it too” with shorter wall clock time and smaller but dedup-friendly output, by combining a switch to pigz (+ --rsyncable) with a bump in the compression level.


(Matt Palmer) #13

Whilst using (niced-to-oblivion) pigz will give you a shorter compression time, I think it’s somewhat orthogonal to the decision to enable --rsyncable (modulo compressed size differences), because any benefit you get from using pigz is equally applicable to both --rsyncable and normal compression. Trading off more CPU time for better compression is an interesting idea, since you’ll still (presumably) get wall clock benefits. Still, though, since it’s a background process, conserving wall clock time isn’t as big a deal as it would be otherwise. I wonder if using gzip for background jobs, but pigz for interactive backups, would be a worthwhile complication…


(Dean Taylor) #14

There is more likely to be some wiggle-room in resources / walk-clock time from piping the output from the currently separate commands.

Currently a 8GB+ tar file is written to disk, then read from disk gziped, then the 8GB gzipped file written to disk…
… it seems to me a simple piping of the several outputs though would save resources and time.

(I haven’t tested this)


(Matt Palmer) #15

Yerch… that does seem like something that should be fixed. It would definitely save a bunch of time (and disk I/O) to not do that. Feel like whipping up a PR?


(Dean Taylor) #16

I could write one - but I wouldn’t feel good about submitting it without decent testing - and I don’t have a dev environment setup.


(Kane York) #17

The #1 problem making this not a simple rewrite is that the backup script uses tar --append to add files over 3 commands, rather than collecting all the filenames it needs and taring them all up at once.


(Sam Saffron) #18

If you have Linux and are running pid 1000 try out the docker wrappers I wrote discourse/bin/docker at master · discourse/discourse · GitHub , they make it ultra trivial to get started …

./boot_dev
./rails s

done you have a dev env.


(Dean Taylor) #19

Ruby is not my forte (i.e. I don’t write Ruby) - but here was my secondary thoughts when I saw pax wasn’t installed as default in Ubuntu:

metadata_params = "--transform='s,#{Regexp.escape(@meta_filename).shellescape},#{Regexp.escape(File.basename(@meta_filename)).shellescape},g' #{@meta_filename.shellescape}"
dump_params = "--transform='s,#{Regexp.escape(@dump_filename).shellescape},#{Regexp.escape(File.basename(@dump_filename)).shellescape},g' #{@dump_filename.shellescape}"
uploads_params = ""
if @with_uploads
        upload_directory = "uploads/" + @current_db
	upload_full_path = File.join(Rails.root, "public/" + upload_directory)
	uploads_params = "--transform='s,#{Regexp.escape(upload_full_path).shellescape},#{Regexp.escape(File.basename(@upload_directory)).shellescape},g' #{@upload_full_path.shellescape}"
end
shell_cmd = "tar --create --dereference #{metadata_params} #{dump_params} #{uploads_params} --verbose --show-transformed-names"

# execute our command or log it something.
`#{shell_cmd} | gzip -5 > #{tar_filename}.gz`

###Notes

  • Full pathnames are used on purpose to avoid any possible collisions
  • The only other thing I would consider replacing / escaping is the use of “commas” (,) in the regular expression - other wise file or path names with a comma included might be of issue.
  • could probably do with an anchor at the beginning of the regular expressions.
  • --verbose --show-transformed-names is there to help debug the replacements.

Yes it can be done cleaner - but this hopefully get’s it working…
… again I haven’t tried this in dev.

By the way - Ruby’s Regexp.escape seems poor - it should take a 2nd parameter for the delimiter.


#20

I’m another user that would like to be able to disable compression on backups. It wastes space on my backup systems which have deduplication + compression built-in. I’m using Attic (Borg is a fork of Attic), and also my own rsnapshot-like implementation on btrfs.

When the backup job runs, I assume it does this?..

  1. Creates a temp folder
  2. Puts dump.sql, meta.json and the images etc into the temp folder
  3. Create the tarball from the temp folder
  4. Deletes the temp folder

Is this correct, or does it work differently?

If this is how it’s done, what is the path to this temp folder?

If that’s the case, then I could have my backup systems access that temp folder directly for the latest backup content (and just ignore the tarball altogether). I guess this would mean having rsync connect to the docker app filesystem rather than the host OS?

To allow us to do this, the only change to Discourse would be that instead of the temp folder being deleted immediately after the tarball is created, that either of the following happens instead:

  • delete yesterday’s backup temp folder as the first step of today’s backup job - this means the temp files are left in place for 24 hours for me to rsync
  • or maybe the temp folder it doesn’t need to be deleted at all? The new backup job would just overwrite old files in the temp folder? Maybe not a good idea if images are deleted?

In either of the scenarios above, it would make sense for the temp folder to be a fixed path, rather than one that contains date/time.


(Dean Taylor) #21

Note to self:

Recently hit space issues again and backups failing so this post needs revisiting soon.

Note that in order to create a backup resulting in a 12GB gzip file I require ~36GB+ of space dedicated to backups, ~24GB+ of free space:

  • 12GB for the backup from the day before
  • 12GB for the new backup file
  • ~12GB+ tar archive to be compressed into gzip file (original DB file backup + original image files

So as backup sizes increase by 1GB, the backup / space requirements are actually increasing by a ~3GB.

This assumes that you are only keeping a single previous backup - where the Discourse setting maximum backups is set to 1.

The Discourse default is 5, so using defaults I would need ~84GB dedicated to backups to allow them to work.


(Jeff Atwood) #22

What is the lion’s share of the backup? I assume uploaded images and so on? Wouldn’t it be easier to specify the backup is database-only, and thus make it a tiny fraction of the overall size?

(Yes, we’d still need some way to back up the images independently, but at least then the urgent need for 100GB+ of space would not be present.)


(Dean Taylor) #23

Yes the lions share of the backup content is “uploads”.

However a backup is not complete without the “uploads”.

My personal target is to move the images / “uploads” to Amazon S3 to avoid this issue for this specific instance, however there is still some testing to be done on a high topic / post count instance before I can trust the migration to S3, some issues already highlighted in that thread (more specifically avoiding a rebake of all posts).

I have other Discourse instances that would benefit in the backups being created in a more streamlined way.


(Scott Smith) #24

I have the same problem as this thread, I have many GB of images and while I want to migrate them to S3 from what I have read the migration script seems a bit buggy still. So, I still have images locally but am running out of disk space given the high ceiling needed to allow a backup. Even if I could delete the old backup before creating the new one it would be OK for me. In fact I have been doing that manually.

Note that the backup system also seems to be failing me on the free disk space calculation, it will fill up the whole disk before giving up, and not even delete the partial files. Then the whole computer gets unhappy. There should be a calculation to not do a backup if there is no disk space for it, taking into account the space needed for the compression etc.

Edit: I am going to run a cron job which will delete the (sole) local backup every day. That should solve my immediate problem, but I think it would also be nice to have an option to immediately delete any (local) backup that was already successfully copied to S3.