Using TarWriter to stream backup

performance

(Kane York) #1

While doing research, I found that this class is already available in Discourse via require 'rubygems/package': Class: Gem::Package::TarWriter (Ruby 2.0.0)

Using this should allow Discourse to take backups without having more than double the required disk space available, by streaming the entire archive to disk through an in-process tar and an in-process gzip.

Usage should look like the following:

destination = File.open(target_filename, "wb")
gz_stream = Zlib::GzipWriter.new(destination, 5)
@tar_writer = Gem::Package::TarWriter.new(gz_stream)

log "Archiving data dump..."
FileUtils.cd(File.dirname(@dump_filename)) do
  @tar_writer.add_file "dump.sql.gz", 0644 do |tf|
    File.open(@dump_filename) do |df|
      IO.copy_stream(df, tf)
    end
  end
end

rel_directory = File.join(Rails.root, "public")
upload_directory = File.join(rel_directory, "uploads", @current_db)
log "Archiving uploads..."

last_progress = Time.now
files_since_progress = 0

Dir[File.join(upload_directory, "**/*")].each do |file|
  stat = File.stat(file)
  relative = file.delete_prefix(rel_directory)
  if stat.directory?
    @tar_writer.mkdir relative, stat.mode
  else
    files_since_progress += 1
    if files_since_progress > 100 or (last_progress < 15.seconds.ago)
      log "Archiving #{file}"
      files_since_progress = 0
      last_progress = Time.now
    end 
    @tar_writer.add_file relative, stat.mode do |tf|
      File.open(file, "rb") { |df| IO.copy_stream(df, tf) }
    end
  end
end

log "Finishing up archive..."
@tar_writer.close 
gz_stream.close
destination.close

remove_tmp_directory

The above code does not have:

  • proper error reporting
  • progress indicators

(Jeff Atwood) #2

Not sure, I thought @sam already had some ideas on this?


(Sam Saffron) #3

I am open to optimising this stuff, but we have to be careful around compatibility.

As a general approach I would like us

backup db as sql > stream to gzip > stream to tar
backup files > stream to cheapest gzip > stream to same tar

This can be done by shelling out or by using a Ruby gem


(Kane York) #4

I think that any solution relying on tar --append is going to be wonky and prone to extra copies.

I should mention that this came up when I was thinking about a “more fully-featured user data checkout”.

The user_visits and topic_timings tables should absolutely be part of the user data checkout (see: new EU regulations, “request access to [the personal data]”), but that means writing a zip file with multiple files in it, not just a single csv. So I thought about, “how does the backup system do this?”