Backups are duplicating and not respecting number to keep on disk

OK, upgrading it now. It takes awhile to cause a problem, so we’ll update if it happens again. Thank you!

3 Likes

We continue to have problems with backups repeatedly failing and then failed backups occupying all of the available disk space. The duplicate issue was addressed and we currently have Discourse set to only take 1 backup every other day.

@gerhard I believe that @Wingtip did a rebuild about 4 days ago, so it should have incorporated the changes you made. Any ideas?

I’m out of ideas. Do you have any indication why it fails? Is there anything in the logs?

3 Likes

To clarify, the backups are being deleted off disk, so we don’t have that problem any more. Two issues currently.

  1. I had backups scheduled for 3:45 UTC in the admin UI, but they seem to run all the time. It’s unclear what’s going on here. The host timezone is ET, but inside the docker container it is UTC and the time is correct.

I thought they were running at 3:45 PM, as the process started at 3:47 PM UTC when I last checked, but nope. When the backup and gzip finished it immediately seemed to start taking another backup. Why? Dunno. Wish I did.

  1. Since backups run at weird times and/or potentially constantly, this blocks sidekiq and users complain about not getting notifications. To fix it I login to the server and kill the pg_dump and gzip processes, which frees up sidekiq and everything is copacetic again, except of course we have no backup.
2 Likes

Apologies for mischaracterizing the problem.

2 Likes

The backup_time_of_day setting uses the 24-hour clock format, so a value of 3:45 is AM.

Scheduling of backups works like this:

  • The ScheduleBackup job runs every day at midnight. It looks at the “last modified” date of the latest backup file and schedules a new backup when it is older than backup_frequency days.

  • The next backup doesn’t run exactly at backup_time_of_day. Discourse adds up to 10 minutes (it’s a random value) to the configured time.

I’m not sure why this process would fail in your case and lead to constantly running backups. The one thing I noticed is that sidekiq is paused longer than needed. Currently it stays paused until the backup is successfully uploaded to S3 or until a failure occurs. I’m going to change that.

How long does it take for Discourse to upload a backup file to S3? You should be able to see that in the backup log when you manually create a backup. Is there anything that might affect the “last modified” timestamp of the backup file on S3 which might confuse the system?

3 Likes

Yes, it would have been pretty crazy for it to default to PM.

Re S3 timestamps, will leave that to @Clay.

I’m manually taking a backup now.

Edit: OK, backup completed, total runtime 44 minutes.

19:34:30	started backup 
19:43:24	DB dump done (9 min)
19:47:17	tar done (4 min)
20:10:02	gzip done (23 min)
20:18:25	upload to s3 done (8 min)

On a side note, we only got a 1.2% compression ratio on the gzipped tarball as the DB dump is already gzipped and the uploads are all compressed images. You guys may want to consider skipping the gzip as it offers very little benefit. Or if other users have uncompressed objects in their backups, perhaps it should be a switch to disable gzip.

backup tar     33,399,429,120 bytes
backup tar.gz  33,037,127,638 bytes

And another note, it looks like temporary DB backups aren’t being properly cleaned up either. Unclear if this only happens when the backup fails, but they’re definitely in there, we have old DB dumps dating back to Jan 19. Oddly sizes vary widely, from 143MB to 5.1GB, which is why I hypothesized they’re failed backups.

root@forum-app:/var/www/discourse/tmp/backups/default# du -sh
12G .

3 Likes

We had this problem in the past @gerhard so this should be addressed. Before, the DB was not compressed independently, so there was value in the gzip step when combined with the images, even if the images compressed poorly. But if the DB is compressed already, there’s very little value in compressing again with images.

If this double-compression step can’t be avoided, the compression level should be set to “almost none” so it can go as fast as possible.

Also @Wingtip that looks like a pretty big increase in backup size to my eye? How much growth is there in backup size over time?

2 Likes

My guess is this upload-to-s3 step is taking a very long amount of time.

Thanks for the detailed information. I don’t see anything out of the ordinary. I’ll definitely take a look at optimizing the gziping, but I don’t have a clue what’s causing your backup issues. Please get back to me if you find a way to reproduce the problem.

No, it’s taking only 8 minutes.

3 Likes

Ahh that makes sense, the DB backup wasn’t compressed before.

Re increase in size, we only have back to Jan 24 in our list because we switched S3 bucket names due to your SSH port lockout issue. @Clay?

Re S3 upload time, was in my post. Took 8 minutes to upload all 33 GB.

@gerhard: Right, it is not reproducible with a manual backup, which makes it more challenging to diagnose. When I had to kill the backups to free up sidekiq today and 4 days ago, they seemed to have both a pg_dump and a gzip running. I killed both processes.

3 Likes

This is correct. The older backups are sitting in the old bucket, which I plan to delete once we get this issue sorted.

So looking back in the old bucket, we were seeing moderate increases in size (1 GB/month or so) until mid-January. On 12 Jan 2019, the backup size was 21.6 GB. On 13 Jan, the backup size was 28.8 GB. I know of no site-content reason for that jump. It then was up to 29.9 GB on 15 Jan and now is coming in at 30.8 GB. Any clue what would have happened mid-Jan to cause the backup sizes to balloon?

Here’s a screenshot of the new bucket:

Here’s a screenshot of the old bucket. Ignore the red lines – those were the duplicate backups from the mystery source that no longer are an issue.

1 Like

A recent update rebaked images to 2x for retina screens, could that have been it?

3 Likes

That would be my guess as well.

4 Likes

Yeah, that change would effectively double storage requirements for all images, which is a ton of space.

I would actually prefer to only store 2x images and downscale them as needed, but that does have potential downsides in that (for example) a 1 pixel border at 2x might not be visible at all at 1x. Big deal for the UI, so you do need both 1x and 2x for all interface elements, but the tradeoff would make sense for people posting cat pics.

3 Likes

Not a bad idea, but on the fly server side downscaling might be somewhat expensive in cpu time. And it adds steps / complexity to every post render.

2 Likes

Can’t you just set the dimensions in pixels and have the user’s browser do it?

The whole point though of having SRCSET is that non retina clients still get to download less data. This would be at odds with it.

Instead… I think a far better suggestion here is to have a “backup” mode that completely skips all optimized images.

@Wingtip can you do a quick breakdown of sizes of folders in your upload area. How big is the original folder, how big is the optmized images one.

We can always (expensively) rebuild all optimized images, which is perfectly reasonable for a total disaster recovery scenario.

4 Likes

Is it less data? You’re simply upscaling existing images in the first place, right? They’re lossy compressed, and there’s no way to actually add information, so I’m not sure why these previously-upscaled 2x images would be larger (in bytes) than 1x.

Skipping 2x images in the backup would work too, sure.

Under the optimized uploads directory, 1X is 100MB, 2X is 4.2GB, and 3X is 9.4GB.

Under the original directory, 1X is 65MB, 2X is 3.8GB, and 3X is 9.0GB.

Why you’re generating 3X images at all is beyond me, honestly, the human eye can’t tell any difference. Although the original dir also has 2X and 3X so I’m probably missing something here, maybe you categorize them by size or something upon upload.

These directory names have nothing to do with what’s inside. They’re just here to ensure we don’t blow up the filesystem with a unique directory with millions of files.

7 Likes