Using TarWriter to stream backup

riking · January 24, 2018, 8:54pm

While doing research, I found that this class is already available in Discourse via require 'rubygems/package': http://ruby-doc.org/stdlib-2.0.0/libdoc/rubygems/rdoc/Gem/Package/TarWriter.html

Using this should allow Discourse to take backups without having more than double the required disk space available, by streaming the entire archive to disk through an in-process tar and an in-process gzip.

Usage should look like the following:

destination = File.open(target_filename, "wb")
gz_stream = Zlib::GzipWriter.new(destination, 5)
@tar_writer = Gem::Package::TarWriter.new(gz_stream)

log "Archiving data dump..."
FileUtils.cd(File.dirname(@dump_filename)) do
  @tar_writer.add_file "dump.sql.gz", 0644 do |tf|
    File.open(@dump_filename) do |df|
      IO.copy_stream(df, tf)
    end
  end
end

rel_directory = File.join(Rails.root, "public")
upload_directory = File.join(rel_directory, "uploads", @current_db)
log "Archiving uploads..."

last_progress = Time.now
files_since_progress = 0

Dir[File.join(upload_directory, "**/*")].each do |file|
  stat = File.stat(file)
  relative = file.delete_prefix(rel_directory)
  if stat.directory?
    @tar_writer.mkdir relative, stat.mode
  else
    files_since_progress += 1
    if files_since_progress > 100 or (last_progress < 15.seconds.ago)
      log "Archiving #{file}"
      files_since_progress = 0
      last_progress = Time.now
    end 
    @tar_writer.add_file relative, stat.mode do |tf|
      File.open(file, "rb") { |df| IO.copy_stream(df, tf) }
    end
  end
end

log "Finishing up archive..."
@tar_writer.close 
gz_stream.close
destination.close

remove_tmp_directory

The above code does not have:

proper error reporting
progress indicators

codinghorror · January 25, 2018, 12:56am

Not sure, I thought @sam already had some ideas on this?

sam · January 25, 2018, 1:00am

I am open to optimising this stuff, but we have to be careful around compatibility.

As a general approach I would like us

backup db as sql > stream to gzip > stream to tar
backup files > stream to cheapest gzip > stream to same tar

This can be done by shelling out or by using a Ruby gem

riking · January 25, 2018, 1:33am

I think that any solution relying on tar --append is going to be wonky and prone to extra copies.

I should mention that this came up when I was thinking about a “more fully-featured user data checkout”.

The user_visits and topic_timings tables should absolutely be part of the user data checkout (see: new EU regulations, “request access to [the personal data]”), but that means writing a zip file with multiple files in it, not just a single csv. So I thought about, “how does the backup system do this?”

Topic		Replies	Views
Reduce local disk space needs by not (redundantly) gzipping backups Feature backups	15	703	May 6, 2025
Add option to disable backup compression Feature	29	5925	August 30, 2024
Backup fails if someone uploads during backup Bug	8	1341	June 8, 2017
Backup restore failing due to .tar versus .tar.gz file extension Support	8	701	January 28, 2025
Nightly backup fails Support	2	529	February 18, 2019

Using TarWriter to stream backup

Related topics