This has been going on for a long time—I have a monthly reminder in my calendar to delete these leftover files. Backup logs are empty: “No logs yet…” and nothing in error logs points to problems with Amazon S3.
Discourse is updated regularly and currently 2.9.0.beta14.
This is a standard install, right? Is there a chance that the OS (or something else) is killing the backup process during the upload? Because even when there’s a backup failure, the local file should be deleted at the end of the process.
Production logs only go back a week, so the older “undeleted” backups fall out of that range, but I’ll keep an eye on the future ones. The only backup error entry was this in the 11/30 log:
Started GET "/.env.backup" for 3.236.147.46 at 2022-11-29 19:15:57 +0000
ActionController::RoutingError (No route matches [GET] "/.env.backup")
I see a new undeleted backup in /var/discourse/shared/standalone/backups/default but nothing in the production.log. Nothing in the production_errors.log either. Where else could I look?
P.S. I ran a backup from the CLI and the backup was successfully removed - I’ll try that a few more times to see if I can get an error there.
Not having success reproducing the undeleted local backup via CLI, but it does keep happening once or twice a week during the nightly backup. I also don’t see any of the backup log output in production.log. Are you sure that’s where it’s written, @pfaffman ?
I think it should be. When I had a similar problem with some other S3 service, I was unable to find errors in either Discourse or their service. And I gave up and switched to something different. But you’re using AWS, S3, the Real Deal, so I’m quite surprised.
I’ve tried looking like this: grep -r "Output file is stored on S3" /var/discourse
as that phrase is the last line of the CLI backup output, but nothing is found.
Any chance that the server reboots due to automatic updates of the host OS? They might happen while the upload to S3 is in progress. Is there anything in the logs of your OS? Maybe reset the backup_time_of_day site setting to the default value or a different time and see if the problem disappears.
No, current uptime is 36 days. I had suspected that the DigitalOcean droplet backup running concurrently might have been the cause, but that happens once a week and my undeleted backups occur more frequently than that.
I’ll try a different backup_time_of_day. It was set to 2:00 UTC, so we’ll see if the default 3:30 UTC makes any difference.
OOOOH! That’s a good one. That would explain it. I bet that’s it. And the middle of the night is a Good Time for both backups and reboots. It doesn’t quite explain why the problem went away when I changed to a different service, but maybe my luck just changed, or whatever I changed to was faster or something.