איך למשוך תמונות מדלי שאני לא הבעלים

pfaffman · 18 בפברואר,‏ 2022,‏ 2:51pm

I have a site with a bunch of images on a CDCK bucket. I have just a bit longer to pull them off of that bucket and move them to the local server (and then move to another bucket).

The uploads:analyze_missing_s3 and uploads:fix_missing_s3 rake tasks don’t find any problems, even though ~25K images are on a distant bucket.

Feature Suggestion: It seems like EnsureS3UploadsExistence should mark uploads as invalid_etag for uploads that are not on the right bucket (that is, don’t have permissions for that bucket?), but it does not. I think it would make sense for this or some other job to notice if uploads were on a bucket that the current site does not have control over, or maybe just isn’t the expected bucket for the site.

I can’t quite tell what verification_status is, as it doesn’t’ seem to verify that it’s something good.

What I tried

ups=Upload.where("url like '//discourse-cloud-file-uploads.s3.dualstack.us-west-2.amazonaws.com/business6%'")                         
ups.update_all(verification_status: 3)

and then

 rake uploads:fix_missing_s3

That rake task printed messages about downloading images, but a day later when the posts were rebaked those images look like ![zawfI7C|346x500](upload://9L0PqY4QpqLOfexXHMMbv00EgaB.jpeg) in raw, but link to: https://test.literatecomputing.com/images/transparent.png

Falco · 18 בפברואר,‏ 2022,‏ 2:57pm

Maybe an adhoc script where you:

List files
wget this list
aws s3 sync em over to your bucket
remap oldbucket newbucket

pfaffman · 18 בפברואר,‏ 2022,‏ 4:57pm

Well, bother. I was afraid that might be the answer.

And I guess that’s the way that @RGJ does it too.

pfaffman · 25 בפברואר,‏ 2022,‏ 5:38pm

But I can’t list files because they are on your bucket and I’m pretty sure I need credentials for a list.

rake uploads:fix_missing_s3 seems to have pulled (most?) things to the local filesystem (uploads are not yet on s3 for this site)

So I did this to fix up the uploads:

def fix_bad_uploads(bad_uploads)
  fixed = 0
  retrieved = 0
  missing = 0
  bad_bucket="//discourse-cloud-file-uploads.s3.dualstack.us-west-2.amazonaws.com/business6/uploads/forumosa"
  bad_uploads.each do |upload|
    url = URI.parse("https:"+upload.url)
    upload.url=upload.url.gsub(bad_bucket,"/uploads/default")
    if File.exists?("/shared/#{upload.url}")
      fixed += 1
      print "1"
      upload.save
    # posts = Post.where("raw like '%#{upload.short_url}%'")
    # posts.each do |post|
    #   post.rebake!
    #   print "."
    # end
    else
      begin
        # retrieve missing
        filename = "/shared#{upload.url}"
        dirname = File.dirname(filename)
        unless File.directory?(dirname)
          FileUtils.mkdir_p(dirname)
        end
        file = File.new(filename, "w")
        Net::HTTP.start(url.host) do |http|
          resp = http.get(url.path)
          open(file, "wb") do |file|
            file.write(resp.body)
          end
        end
        file.close
        print "+"
        upload.save if File.exists?(filename)
      rescue => e
        puts "bad: #{e}"
        missing += 0
        sleep 1
        print "0"
      end
    end
  end
end

This fixed up most of them. But there seem to be some posts that have an uploads:// entry for which there isn’t an Upload in the database. Rebaking those ends up with a transparent.png.

So then I tried something like this:

def get_missing_short_url(short_url)
  prefix = "https://discourse-cloud-file-uploads.s3.dualstack.us-west-2.amazonaws.com/business6/uploads/forumosa/original/3X"
  remove_url = "https://discourse-cloud-file-uploads.s3.dualstack.us-west-2.amazonaws.com/business6/uploads/forumosa/"
  sha1= Upload.sha1_from_short_url(short_url)
  extension = short_url.split(".").last
  upload = Upload.find_by(sha1: sha1)
  if !upload
    # try to find it in s3
    one = sha1[0]
    two=sha1[1]
    url_link = "#{prefix}/#{one}/#{two}/#{sha1}.#{extension}"
    puts "URL: #{url_link}"
    sleep 1
    url = URI.parse(url_link)
    full_filename = url_link.gsub(remove_url,"/shared/uploads/default/")
    filename = "/tmp/#{File.basename(url_link.gsub(remove_url,"/shared/uploads/default/"))}"
    dirname = File.dirname(filename)
    unless File.directory?(dirname)
      FileUtils.mkdir_p(dirname)
    end
    File.open(filename, "w") do |file|
      Net::HTTP.start(url.host) do |http|
        resp = http.get(url.path)
        open(file, "wb") do |file|
          file.write(resp.body)
        end
      end
    end
      # make upload for file
    File.open(filename, "r") do |file|
      upload = UploadCreator.new(
        file,
        File.basename(file),
      ).create_for(Discourse.system_user.id)
    end
    if upload.persisted?
      puts "We did it! #{upload.id}"
    else
      puts "darn. #{upload.errors.full_messages}"
      sleep 5
    end
    File.open(filename, "w") do |file|
      Net::HTTP.start(url.host) do |http|
        resp = http.get(url.path)
        open(file, "wb") do |file|
          file.write(resp.body)
        end
      end
    end
    end
  upload
end

That mostly works, but in my tests sometimes I fail to infer the correct S3 URL from the sha that I infer from the short URL. I’m not sure how to fix that.

Also, one of them somehow ended up with a sha that was different from the one in the filename of the s3 path.

My current thinking now is to start by going through all of cooked and getting all of the https://discourse-cloud-file-uploads urls and then going about updating Upload records that refer to them and creating the ones that are missing.

Am I missing something obvious?

Falco · 25 בפברואר,‏ 2022,‏ 5:40pm

Isn’t the uploads table a list of files?

pfaffman · 25 בפברואר,‏ 2022,‏ 6:00pm

That’s what I thought! But there are some uploads:// that exist in raw, that do not have entries in the uploads table (at least when I search for the sha as provided by Upload.sha1_from_short_url(short_url)).

Most, but not all, I’ve been able to infer the bucket URL from the sha (I don’t quite understand 1X, 2X, 3X, but it seems they’re al lin 3x).

So, not, it it not the case that the uploads table is a complete list of the files in question.

נושא		תגובות	צפיות
How to control this mess of shifting uploads from S3 to local Support s3	2	39	3 ביולי,‏ 2025
Stop using Amazon s3 for uploads Support	52	4798	22 באוגוסט,‏ 2021
Changing s3 bucket for uploads Support	12	1540	11 במאי,‏ 2021
Migrating old uploads to S3 Installation s3	8	1575	11 במאי,‏ 2021
How to restore missing S3 images from post Support	6	718	19 בינואר,‏ 2024

איך למשוך תמונות מדלי שאני לא הבעלים

What I tried

נושאים קשורים