Using TarWriter to stream backup

While doing research, I found that this class is already available in Discourse via require 'rubygems/package': http://ruby-doc.org/stdlib-2.0.0/libdoc/rubygems/rdoc/Gem/Package/TarWriter.html

Using this should allow Discourse to take backups without having more than double the required disk space available, by streaming the entire archive to disk through an in-process tar and an in-process gzip.

Usage should look like the following:

destination = File.open(target_filename, "wb")
gz_stream = Zlib::GzipWriter.new(destination, 5)
@tar_writer = Gem::Package::TarWriter.new(gz_stream)

log "Archiving data dump..."
FileUtils.cd(File.dirname(@dump_filename)) do
  @tar_writer.add_file "dump.sql.gz", 0644 do |tf|
    File.open(@dump_filename) do |df|
      IO.copy_stream(df, tf)
    end
  end
end

rel_directory = File.join(Rails.root, "public")
upload_directory = File.join(rel_directory, "uploads", @current_db)
log "Archiving uploads..."

last_progress = Time.now
files_since_progress = 0

Dir[File.join(upload_directory, "**/*")].each do |file|
  stat = File.stat(file)
  relative = file.delete_prefix(rel_directory)
  if stat.directory?
    @tar_writer.mkdir relative, stat.mode
  else
    files_since_progress += 1
    if files_since_progress > 100 or (last_progress < 15.seconds.ago)
      log "Archiving #{file}"
      files_since_progress = 0
      last_progress = Time.now
    end 
    @tar_writer.add_file relative, stat.mode do |tf|
      File.open(file, "rb") { |df| IO.copy_stream(df, tf) }
    end
  end
end

log "Finishing up archive..."
@tar_writer.close 
gz_stream.close
destination.close

remove_tmp_directory

The above code does not have:

  • proper error reporting
  • progress indicators
5 Likes

Not sure, I thought @sam already had some ideas on this?

I am open to optimising this stuff, but we have to be careful around compatibility.

As a general approach I would like us

backup db as sql > stream to gzip > stream to tar
backup files > stream to cheapest gzip > stream to same tar

This can be done by shelling out or by using a Ruby gem

4 Likes

I think that any solution relying on tar --append is going to be wonky and prone to extra copies.

I should mention that this came up when I was thinking about a “more fully-featured user data checkout”.

The user_visits and topic_timings tables should absolutely be part of the user data checkout (see: new EU regulations, “request access to [the personal data]”), but that means writing a zip file with multiple files in it, not just a single csv. So I thought about, “how does the backup system do this?”

3 Likes