Using TarWriter to stream backup


(Kane York) #1

While doing research, I found that this class is already available in Discourse via require 'rubygems/package': Class: Gem::Package::TarWriter (Ruby 2.0.0)

Using this should allow Discourse to take backups without having more than double the required disk space available, by streaming the entire archive to disk through an in-process tar and an in-process gzip.

Usage should look like the following:

destination =, "wb")
gz_stream =, 5)
@tar_writer =

log "Archiving data dump..." do
  @tar_writer.add_file "dump.sql.gz", 0644 do |tf| do |df|
      IO.copy_stream(df, tf)

rel_directory = File.join(Rails.root, "public")
upload_directory = File.join(rel_directory, "uploads", @current_db)
log "Archiving uploads..."

last_progress =
files_since_progress = 0

Dir[File.join(upload_directory, "**/*")].each do |file|
  stat = File.stat(file)
  relative = file.delete_prefix(rel_directory)
    @tar_writer.mkdir relative, stat.mode
    files_since_progress += 1
    if files_since_progress > 100 or (last_progress < 15.seconds.ago)
      log "Archiving #{file}"
      files_since_progress = 0
      last_progress =
    @tar_writer.add_file relative, stat.mode do |tf|, "rb") { |df| IO.copy_stream(df, tf) }

log "Finishing up archive..."


The above code does not have:

  • proper error reporting
  • progress indicators

(Jeff Atwood) #2

Not sure, I thought @sam already had some ideas on this?

(Sam Saffron) #3

I am open to optimising this stuff, but we have to be careful around compatibility.

As a general approach I would like us

backup db as sql > stream to gzip > stream to tar
backup files > stream to cheapest gzip > stream to same tar

This can be done by shelling out or by using a Ruby gem

(Kane York) #4

I think that any solution relying on tar --append is going to be wonky and prone to extra copies.

I should mention that this came up when I was thinking about a “more fully-featured user data checkout”.

The user_visits and topic_timings tables should absolutely be part of the user data checkout (see: new EU regulations, “request access to [the personal data]”), but that means writing a zip file with multiple files in it, not just a single csv. So I thought about, “how does the backup system do this?”