Backups not being automatically deleted (1.6.4 stable)


(ljpp) #1

Continuing earlier discussion as the issue persists.

So I have maximum local backups set to 3, but Discourse does not delete older backups. I have S3 offsite backups enabled and they work. I have the setting enabled not to delete anything from S3. Backups are done every night.

  • I cannot accurately pinpoint when this has started happening, but the timeline points to early-mid August. It has worked before. This may hint towards Discourse 1.5 -> 1.6.1 upgrade which I performed around that time (11th of August)
  • I have tried redefining the number of backups to keep, there has been a number of _launcher rebuild_s and reboots after that. Still the issue persists.

I have limited sysadmin skillz, but if instructed I am happy to provide any logs or data that might be of any help. The problem is quite severe, since the backups are 0.25GB’s each so if I forget to clean them manually, I will run out of disk space rather soon.

Any ideas?


(Wes Osborn) #2

Did you recently change your sitename?


(ljpp) #3

No, the site name has not changed. But thanks for giving me this idea - I store backups on the S3 for 90 days, and based on your comment I decided to have a look at the file naming.

The file naming convention has changed August 15th = with the 1.5 -> 1.6.1 upgrade, which my estimate of the time when this issue has surfaced. :bug: There are two backups on the 14th, the last one being manually issued just before the upgrade process. The next automated backup, on v1.6.1, has a different file naming scheme.

What puzzles me is that I am not seeing this issue on my other Discourse instance, a much smaller one, which has also gone through the 1.5->1.6 update cycle.

Any other ideas or suggestions?


(ljpp) #4

Forgot about this for a while and had gigabytes of backups. Any ideas how to debug this very nasty issue? @sam?


(Jeff Atwood) #5

@tgxworld can have a look. Did you change the site name or URL at all?


(Alan Tan) #6

I can’t reproduce this locally or on my production site.

@ljpp Can you take a backup and look for [2016-10-18 04:01:45] Removing old backups... in your logs? If that line is present and the older backups are still not deleted, you’ll need to do abit more debugging:

./launcher enter app
rails c

# I need the output of the following commands
=> Backup.all.size
=> SiteSetting.maximum_backups
=> Backup.all.map(&:filename) # This should match all your exisiting backups

(ljpp) #7

Thanks for the attention. @codinghorror, no site name changes.

I did a manual backup and noticed that this issue is isolated to scheduled backups. I have the number of backups set at 3 and the amount of local backup archives stays at 3 when starting the backups manually.

[2016-10-18 06:17:07] Removing old backups...

The outputs:

[1] pry(main)> Backup.all.size
=> 3
[2] pry(main)> SiteSetting.maximum_backups
=> 3
[3] pry(main)> Backup.all.map(&:filename)
=> ["tappara-co-2016-10-18-061440-v20160727233044.tar.gz",
 "tappara-co-2016-10-18-060845-v20160727233044.tar.gz",
 "tappara-co-2016-10-18-060840-v20160727233044.tar.gz"]

All these look correct.

And no, this is not easily reproduced. I run 3 Discourse instances, and only one has this issue.


(Alan Tan) #8

I can’t reproduce it even with scheduled backups.

Can you try triggering the scheduled backup manually?

./launcher enter app
rails c

# I need the output of the following commands
=> Jobs.enqueue_in(1, :create_backup)

You should end up with 3 backups on disk.


(ljpp) #9

Alright, so the output is:

Jobs.enqueue_in(1, :create_backup)
=> "a6060bbe4f911cde5be0ba9b"

It created the 4th backup, but then deleted the oldest. So total count stayed at 3. Now I am mind boggled.


(Jeff Atwood) #10

So we can’t repro this, then?


(ljpp) #11

@codinghorror @tgxworld

Interestingly, last nights auto-backup and the following backup file deletion has succeeded. Total count is still 3, as expected. I have not altered any settings nor have I done anything for the server backend. I have no idea what has enabled the backup deletion to work again.

Please keep this open, but let’s halt the debugging effort for a while. I’ll let Discourse do it’s thing and observe whether the backup count starts increasing again. Honestly, ever since I raised the issue I have had to delete them manually and now after the debugging started, it’s the first time it works as expected. Impressive demo effect.


(Alan Tan) #12

I’ll close this first. If it happens again and you can reproduce it consistently, flag this topic and I’ll reopen :slight_smile:


(Alan Tan) #13

(Rafael dos Santos Silva) #14

(ljpp) #15

It’s happening again, like in Twin Peaks.

tappara-co-2016-10-20-010923-v20160727233044.tar.gz
tappara-co-2016-10-19-010747-v20160727233044.tar.gz
tappara-co-2016-10-18-190356-v20160727233044.tar.gz	
tappara-co-2016-10-18-063657-v20160727233044.tar.gz

For a reason unknown, the number of backups has again increased to 4 after last nights auto-backup, so Discourse has failed to delete one.


(Alan Tan) #16

Any errors in the logs?


(ljpp) #17

In which logs, to be more exact?

And the count has increased +1 again during last night. I did a manual backup, and again Discourse removed 2 backups, and the count is back to 3.


(Sam) #18

This has started happening on my instance as well, I just noticed last night that my instance had 6 backups stored locally despite a maximum setting of 3.

I don’t see anything failing in sidekiq or any errors in /log (don’t know if it would show up in either place anyway), and manual deletion of the backups is working.

My s3 bucket is empty now as well, so it looks like it might not be uploading backups to s3 correctly now either? Or maybe it was emptied out when I deleted the older backups manually, I don’t really know much about how that process works though, so just tossing that out there.

I stay more or less on the latest version of Discourse, and have not changed anything substantial on my site since this started happening (October 23rd was the oldest backup being erroneously stored, so around then I suppose).


(Alan Tan) #19

Did you get the pm from the system user with the logs? Try taking a manual backup and see if there are any errors?


(Sam) #20

Is there supposed to be a pm for automatic backups? Nevermind, it looks like our system user has been sending those pms only to itself, odd. The last PM it sent was 3 days ago, for the October 25th backup.

Let me try a manual backup and see what happens.

Edit:

It started and ran through fine, but it’s been spinning at “Executing the after_create_hook for the backup” for a couple minutes now. I’m not sure how long this normally takes though, I haven’t run a manual backup in quite a while.

Edit 2:

The “Executing the after_create_hook for the backup” thing still hasn’t finished yet. The backup itself was successfully created (about ten minutes ago now), but the process has yet to wrap up for some reason.

Edit 3:

Cancelled it at 15 minutes. “Executing the after_create_hook” never finished, but no errors were generated either. The backup was created successfully. Nothing was uploaded to s3 and the system user has not sent a pm about the backup.