How to pull images from bucket I don't own

I have a site with a bunch of images on a CDCK bucket. I have just a bit longer to pull them off of that bucket and move them to the local server (and then move to another bucket).

The uploads:analyze_missing_s3 and uploads:fix_missing_s3 rake tasks don’t find any problems, even though ~25K images are on a distant bucket.

Feature Suggestion: It seems like EnsureS3UploadsExistence should mark uploads as invalid_etag for uploads that are not on the right bucket (that is, don’t have permissions for that bucket?), but it does not. I think it would make sense for this or some other job to notice if uploads were on a bucket that the current site does not have control over, or maybe just isn’t the expected bucket for the site.

I can’t quite tell what verification_status is, as it doesn’t’ seem to verify that it’s something good.

What I tried

ups=Upload.where("url like '//discourse-cloud-file-uploads.s3.dualstack.us-west-2.amazonaws.com/business6%'")                         
ups.update_all(verification_status: 3)

and then

 rake uploads:fix_missing_s3 

That rake task printed messages about downloading images, but a day later when the posts were rebaked those images look like ![zawfI7C|346x500](upload://9L0PqY4QpqLOfexXHMMbv00EgaB.jpeg) in raw, but link to: https://test.literatecomputing.com/images/transparent.png

2 Likes

Maybe an adhoc script where you:

  1. List files
  2. wget this list
  3. aws s3 sync em over to your bucket
  4. remap oldbucket newbucket
4 Likes

Well, bother. I was afraid that might be the answer.

And I guess that’s the way that @RGJ does it too. :crying_cat_face:

1 Like

But I can’t list files because they are on your bucket and I’m pretty sure I need credentials for a list.

rake uploads:fix_missing_s3 seems to have pulled (most?) things to the local filesystem (uploads are not yet on s3 for this site)

So I did this to fix up the uploads:

def fix_bad_uploads(bad_uploads)
  fixed = 0
  retrieved = 0
  missing = 0
  bad_bucket="//discourse-cloud-file-uploads.s3.dualstack.us-west-2.amazonaws.com/business6/uploads/forumosa"
  bad_uploads.each do |upload|
    url = URI.parse("https:"+upload.url)
    upload.url=upload.url.gsub(bad_bucket,"/uploads/default")
    if File.exists?("/shared/#{upload.url}")
      fixed += 1
      print "1"
      upload.save
    # posts = Post.where("raw like '%#{upload.short_url}%'")
    # posts.each do |post|
    #   post.rebake!
    #   print "."
    # end
    else
      begin
        # retrieve missing
        filename = "/shared#{upload.url}"
        dirname = File.dirname(filename)
        unless File.directory?(dirname)
          FileUtils.mkdir_p(dirname)
        end
        file = File.new(filename, "w")
        Net::HTTP.start(url.host) do |http|
          resp = http.get(url.path)
          open(file, "wb") do |file|
            file.write(resp.body)
          end
        end
        file.close
        print "+"
        upload.save if File.exists?(filename)
      rescue => e
        puts "bad: #{e}"
        missing += 0
        sleep 1
        print "0"
      end
    end
  end
end

This fixed up most of them. But there seem to be some posts that have an uploads:// entry for which there isn’t an Upload in the database. Rebaking those ends up with a transparent.png.

So then I tried something like this:

def get_missing_short_url(short_url)
  prefix = "https://discourse-cloud-file-uploads.s3.dualstack.us-west-2.amazonaws.com/business6/uploads/forumosa/original/3X"
  remove_url = "https://discourse-cloud-file-uploads.s3.dualstack.us-west-2.amazonaws.com/business6/uploads/forumosa/"
  sha1= Upload.sha1_from_short_url(short_url)
  extension = short_url.split(".").last
  upload = Upload.find_by(sha1: sha1)
  if !upload
    # try to find it in s3
    one = sha1[0]
    two=sha1[1]
    url_link = "#{prefix}/#{one}/#{two}/#{sha1}.#{extension}"
    puts "URL: #{url_link}"
    sleep 1
    url = URI.parse(url_link)
    full_filename = url_link.gsub(remove_url,"/shared/uploads/default/")
    filename = "/tmp/#{File.basename(url_link.gsub(remove_url,"/shared/uploads/default/"))}"
    dirname = File.dirname(filename)
    unless File.directory?(dirname)
      FileUtils.mkdir_p(dirname)
    end
    File.open(filename, "w") do |file|
      Net::HTTP.start(url.host) do |http|
        resp = http.get(url.path)
        open(file, "wb") do |file|
          file.write(resp.body)
        end
      end
    end
      # make upload for file
    File.open(filename, "r") do |file|
      upload = UploadCreator.new(
        file,
        File.basename(file),
      ).create_for(Discourse.system_user.id)
    end
    if upload.persisted?
      puts "We did it! #{upload.id}"
    else
      puts "darn. #{upload.errors.full_messages}"
      sleep 5
    end
    File.open(filename, "w") do |file|
      Net::HTTP.start(url.host) do |http|
        resp = http.get(url.path)
        open(file, "wb") do |file|
          file.write(resp.body)
        end
      end
    end
    end
  upload
end

That mostly works, but in my tests sometimes I fail to infer the correct S3 URL from the sha that I infer from the short URL. I’m not sure how to fix that.

Also, one of them somehow ended up with a sha that was different from the one in the filename of the s3 path.

My current thinking now is to start by going through all of cooked and getting all of the https://discourse-cloud-file-uploads urls and then going about updating Upload records that refer to them and creating the ones that are missing.

Am I missing something obvious?

Isn’t the uploads table a list of files?

That’s what I thought! But there are some uploads:// that exist in raw, that do not have entries in the uploads table (at least when I search for the sha as provided by Upload.sha1_from_short_url(short_url)).

Most, but not all, I’ve been able to infer the bucket URL from the sha (I don’t quite understand 1X, 2X, 3X, but it seems they’re al lin 3x).

So, not, it it not the case that the uploads table is a complete list of the files in question.