Launcher rebuild fails if s3 settings aren't correct

One of my sites won’t come up after a recent update.

It’s a digital ocean droplet setup via the standard docker, and has been up & running for about 1.5 years.

Today it failed to restart, here’s the full rebuild attempt after an apt-get upgrade -y & reboot

There are no plugins installed.

It was set up to backup to s3 (my other sites aren’t) and this is of-interest.

16287:C 28 May 02:33:33.481 * DB saved on disk
16287:C 28 May 02:33:33.482 * RDB: 18 MB of memory used by copy-on-write
155:M 28 May 02:33:33.532 * Background saving terminated with success
rake aborted!
Aws::S3::Errors::PermanentRedirect: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/seed-fu-2.3.5/lib/seed-fu/runner.rb:46:in `eval'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/aws-sdk-core-2.5.3/lib/aws-sdk-core/plugins/s3_sse_cpk.rb:19:in `call'

here’s the full rebuild attempt

https://gist.github.com/YesThatAllen/dde253dcc23d0f132cb8acbdcaea0a7e#file-launcher-rebuild-txt-L3617-L3630

My other three sites are fine, not sure what’s going on with this one, any pointers would be appreciated.

Just a wild guess: have you seen this topic below?

Thanks. I’d checked for this. However, there were no plugins installed.

Whilst this error is far from perfectly helpful, I’m guessing that AWS has deprecated (and now removed) whatever style of endpoint URL you’ve configured, and now everything is awful. Check your S3-related site settings and compare them against the S3 documentation for correctness.

1 Like

The value is set in the ui, and the site is down, so I can’t get to the value using normal means.

I suppose I can decompress a recent backup and edit the value in the pg dump. I’d have to know which key to empty out.

What are the odds that a change could be committed which lets launcher continue past this error?

The settings in question are

74	backup_frequency		3	1	2015-09-21 12:28:56.795021	2015-09-21 12:28:56.795021
75	s3_backup_bucket		1	backups-discuss	2016-05-18 14:53:39.422497	2016-05-18 14:53:39.422497
76	s3_disable_cleanup		5	t	2016-05-18 14:53:40.325988	2016-05-18 14:53:40.325988
77	s3_upload_bucket		1	backups-discuss	2016-05-18 15:10:17.459986	2016-07-25 21:50:42.532383
78	s3_secret_access_key	1	REDACTEDAi8w/JfYNGNt9KrYA5MEuoREDACTED	2016-05-18 15:10:18.783565	2016-05-18 15:10:18.783565
79	s3_access_key_id		1	REDACTEDGNC6OS3REDACTED	2016-05-18 15:10:19.543714	2016-05-18 15:10:19.543714
80	enable_s3_uploads		5	t	2016-05-18 15:10:27.6396	2016-05-18 15:10:27.6396

Any idea what’s wrong here?

We’ve restored the DO host to the day before I updated.

Here’s what the settings look like… what’s been entered incorrectly?

I restored the entire image so our site is up, but I have no reason to think I won’t encounter a failure on the next one-click or ./launcher rebuild app failure.

I’m looking here:

I don’t see a smoking gun as to the problem. Any pointers would be appreciated.

before trying an update again, I’ve been waiting for any feedback on what’s set wrong here

I can’t help but wonder if this error:

Aws::Errors::MissingCredentialsError: unable to sign request without credentials set

https://meta.discourse.org/t/upgrade-error-6-1-2017/63756/8

is related

Yes, I wonder what that is and how to fix it.

Why is a rebuild process trying to talk to our s3 backup endpoint at all?

Why is a failure to do so resulting in a failure to rebuild?

ok, I’ve spent enough time troublehshooting this to be certain:

If the s3 backup settings are incorrect, and enable_s3_backups is true, then a normal rebuild will fail.

I manage the site for a friend who entered his s3 keys incorrectly from day 1 and I’ve never had a problem rebuilding his container before.

An update to discourse/discourse_docker which skips trying to talk to s3, or at least gracefully accepts failure, would mean that people aren’t blindsided by an s3 backup failure during a rebuild

5 Likes

The backtrace indicates that the rebuild attempted to upload an avatar to S3, because it was told that was where avatars are stored for this site. Chances are everyone’s avatar uploads have been exploding but nobody noticed, so in a way it’s nice you’ve been told it’s broken now, so it can be fixed… :troll:

1 Like

I can’t wait to hear how.

I see, enable s3 uploads was, in fact, enabled, and the s3 setup was not correct.

The s3 config on this site was done a long time ago, and I haven’t run in to this problem before.

I wonder what changed about the upload process lately, but more to the point:

Is the some agreement that if an upload to s3 fails during upgrade or bootstrap, the rebuild should be able to continue?

2 Likes

Maybe, but this might be a really rare problem for anyone to have?

It’ll be rarer still once discourse warns of s3 failures.

The problem is that once failed, I was left without a way to get the system up & running again… I had to roll back to a system level snapshot from a few days ago, so I could get around the inability to rebuild.

2 Likes

I think that is a major problem, from a design standpoint. In my books, a rebuild should only fail if something is screwed up on my box (i.e. something broken, or outdated), or I can’t reach git to get the newest version. If my box isn’t broken, then rebuild shouldn’t break it. If rebuild notices that I’ve messed up my S3 config then it should say “I can proceed if you want, but your uploads and backups are/will be missing. Do you want to abort or proceed, Meatbag?”

3 Likes

I think it’s really rare bc it’s brand new in the past month.

2 Likes