UpdateGravatar job when the gravatar hasn't changed but the upload file is inaccessible

I’ve been racking my brains around why I’ve got lots of “generic heads avatars” but the UpdateGravatar job was not updating the UserAvatar yet not logging ANYTHING… I managed to track it down to:

        if tempfile
          upload = UploadCreator.new(tempfile, 'gravatar.png', origin: gravatar_url, type: "avatar").create_for(user_id)

          if gravatar_upload_id != upload.id
            gravatar_upload&.destroy!
            self.gravatar_upload = upload
            save!
          end
        end

UploadCreator returns the id for the extant file if the gravatar hasn’t changed, so we just abort.

BUT:

[46] pry(main)> Upload.find(3)
=> #<Upload:0x0000561fc5b5b8d8
 id: 3,
 user_id: 1,
 original_filename: "gravatar.png",
 filesize: 166131,
 width: 360,
 height: 360,
 url: "/uploads/default/original/1X/0430db64b0848c22047072b62e52bf733b105010.png",
 created_at: Wed, 01 Aug 2018 19:06:17 UTC +00:00,
 updated_at: Wed, 01 Aug 2018 19:06:17 UTC +00:00,
 sha1: "0430db64b0848c22047072b62e52bf733b105010",
 origin: "https://www.gravatar.com/avatar/e8b7922236865bfc10fb22eb13f9c394.png?s=360&d=404",
 retain_hours: nil,

this looks all well and good until you realize I’m working in the :cloud: but the upload record is pointing at the load filesystem instead of the :cloud:. So the whole time, Discourse can’t read the file but the update job won’t put a new one into place.

As for how it got into this state in the first place, that’s something else I’m investigating.

Perhaps nothing needs to change in the UpdateGravatar job itself, perhaps something should be done to recover from this state when retrieving the user avatar (e.g. invalidating the upload)… or maybe this is an UploadCreator bug and it should be verifying the existing upload.is accessible rather than assuming.

I bet that would fix… Other things :slight_smile:

6 Likes

I think a big pile of pain here can be removed once we run the the migrate local to the cloud job.

I am pretty sure @tgxworld rewrote a lot of the process to avoid this particular failure case. Can we close this @tgxworld ?

1 Like

By :cloud:, do you mean that you have S3 uploads enabled yet the upload is still being stored on the local file system? If that is the case, s3 uploads might not have been configured correctly since the URL is generated based on the current store the Discourse is currently configured with.

1 Like

In this particular case it was Azure, but yeah same idea.

I took a backup from a metal site, restored it to an Azure-enabled site, then had to run the uploads:migrate rake task.

I think I was running into this before running the rake task, but I can easily recreate the scenario.

The right answer is probably “UploadCreator should probably just re-store the file if it’s being uploaded again but can’t be accessed.”

1 Like

I think I know what is happening here. When you upload the same upload again, the sha1 is the same so UploadCreator will just return the existing upload record that matches the sha1. The challenge here is how we’re going to determine that the store has changed so that we can decide if we need to re-store the file.

@sam Do we want this to go into the 2.1 release or is after OK?

Yep, exactly.

I don’t think we need to determine that; we could just try accessing the existing path. If we can’t access the file, re-store it under the current store’s path.

1 Like

I think the correct thing to do here is to make sure we have a proper rake task to fix this stuff. If you are using s3 and somehow hit a situation where some uploads are both local and missing, we should have a task to clean up this mess. Simplest thing to do here would be to have a task for “clean up local” which nukes all local uploads that are missing.

Any rake tasks we want are totally fine for 2.1

1 Like

We already have the rake task to migrate local uploads to s3. The problem @supermathie is describing is that he feels that UploadCreator should re-upload the file to the store if the UploadCreator detects that the current Upload#url is no longer valid.

1 Like

Yeah BUT … we do not have a rake task for bin/rake nuke_bad_local_uploads_from_orbit (and a dry run for that as well) … this is what I want us to run here on sites exhibiting this issue.

We also got to fix migrate to s3 job so it deals cleanly with the “oh no, this file is missing from local” better.

I don’t think it is a bad local file in this case. On the :cloud: the uploads aren’t shared across instances so depending on which instance the update gravatar job is run, the result for trying to read a local file is different.

Oh they absolutely are, the gravatars are all stored on S3. Once you shift to S3 everything must be on S3 always under all conditions.

Technically this is an anti pattern:

When on S3 we should always be 1000% on S3… not S3 for some files and not S3 for others.

Gravatar job uses upload creator which stores stuff on S3.

That is not what @supermathie is describing/requesting. Based on what I’m reading here, the feature being request here is that UploadCreator should re-store the file if the store for the upload has changed. This is irregardless of whether all uploads have already been migrated properly to the new store or not.

Yeah I am pushing back here for now. I would prefer to first have a rake job the “cleans up any giant mess” for 2.1

For 2.2 we can refactor so perhaps we add a column for storage engine into Upload and OptimizedImage and then have some extra resiliency built into these processes, but it is a far bigger change.

For now all I want is, big mess cleanup rake task.

There is an open question around “when/how” do we validate that our s3 pointers are still good and not 404ing. But that is also a hard question to answer.

1 Like

There’s other things we could potentially do:

  • when we try and access an upload but fail (e.g. the avatar) we could e.g. mark it as known-bad
  • (by the way, when we send back the “default” avatar we should probably set cache headers to only cache for, say, 1 hour? we should do that)

This all started with me taking a metal backup and restoring it onto an Azure site, so let me do that again, do a proper uploads:migrate, and see how it looks after that.

Never mind… we already do the right thing (of course we do! we know what we’re doing!). Should have checked.

a “default image” result looks like:

HTTP/1.1 200 OK
Cache-Control: max-age=600, public
Content-Length: 2608
Content-Type: image/png
Last-Modified: Fri, 31 Aug 2018 03:48:45 GMT (current time)

vs.

a good result:

HTTP/1.1 200 OK
Cache-Control: max-age=31556952, public, immutable
Content-Length: 1478
Content-Type: image/png
Last-Modified: Fri, 31 Aug 2018 03:30:06 GMT

A post was split to a new topic: Do not set current time as ‘Last Modified’ when “not found” avatar is used

@supermathie I’m going to close this for now. I don’t think it is the application’s job to determine if the store has changed. If the site has been moved to an S3 store, there shouldn’t be any upload records that contain a URL that belongs to a different store.

5 Likes