Backups keep taking down my forum

I keep going through cycles where the backup creation is taking down my site due to space issues.

I have set it up where I use a separate DO volume to store the backups and uploads. The volume is 300GB

My backup settings:

The problem has been reoccurring for months. I will get messages about the backup failing in the admin inbox (see below)

No space left on device
[2024-02-14 03:43:34] Finalizing backup...
[2024-02-14 03:43:34] Creating archive: elite-fourum-2024-02-14-033845-v20240204204532.tar.gz
[2024-02-14 03:43:34] Making sure archive does not already exist...
[2024-02-14 03:43:34] Creating empty archive...
[2024-02-14 03:43:34] Archiving data dump...
[2024-02-14 03:43:58] Archiving uploads...
[2024-02-14 03:55:03] Removing tmp '/var/www/discourse/tmp/backups/default/2024-02-14-033845' directory...
[2024-02-14 03:55:03] Gzipping archive, this may take a while...
[2024-02-14 04:25:38] EXCEPTION: /var/www/discourse/lib/discourse.rb:138:in `exec': Failed to gzip archive.

gzip: /var/www/discourse/public/backups/default/elite-fourum-2024-02-14-033845-v20240204204532.tar.gz: No space left on device

[2024-02-14 04:25:38] /var/www/discourse/lib/discourse.rb:172:in `execute_command'
/var/www/discourse/lib/discourse.rb:138:in `exec'
/var/www/discourse/lib/discourse.rb:34:in `execute_command'
/var/www/discourse/lib/backup_restore/backuper.rb:253:in `create_archive'
/var/www/discourse/lib/backup_restore/backuper.rb:40:in `run'
/var/www/discourse/lib/backup_restore.rb:13:in `backup!'
/var/www/discourse/app/jobs/regular/create_backup.rb:10:in `execute'
/var/www/discourse/app/jobs/base.rb:297:in `block (2 levels) in perform'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rails_multisite-5.0.0/lib/rails_multisite/connection_management.rb:82:in `with_connection'
/var/www/discourse/app/jobs/base.rb:284:in `block in perform'
/var/www/discourse/app/jobs/base.rb:280:in `each'
/var/www/discourse/app/jobs/base.rb:280:in `perform'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:202:in `execute_job'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:170:in `block (2 levels) in process'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/middleware/chain.rb:177:in `block in invoke'
/var/www/discourse/lib/sidekiq/pausable.rb:132:in `call'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/middleware/chain.rb:179:in `block in invoke'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/middleware/chain.rb:182:in `invoke'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:169:in `block in process'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:136:in `block (6 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/job_retry.rb:113:in `local'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:135:in `block (5 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq.rb:44:in `block in <module:Sidekiq>'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:131:in `block (4 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:263:in `stats'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:126:in `block (3 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/job_logger.rb:13:in `call'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:125:in `block (2 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/job_retry.rb:80:in `global'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:124:in `block in dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/job_logger.rb:39:in `prepare'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:123:in `dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:168:in `process'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:78:in `process_one'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:68:in `run'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/component.rb:8:in `watchdog'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/component.rb:17:in `block in safe_thread'
[2024-02-14 04:25:38] Deleting old backups...
[2024-02-14 04:25:38] Cleaning stuff up...
[2024-02-14 04:25:38] Removing '.tar' leftovers...
[2024-02-14 04:25:39] Marking backup as finished...
[2024-02-14 04:25:39] Notifying 'system' of the end of the backup...

But then what will happen is I will get a series of partially(?) complete backups that I am not notified about afaik. They will accumulate until it gets to the point where a new backup is attempted to be created every day and it takes down my site at the same time.

I have to manually ssh in and remove these because they don’t show up on the /admin/backups page.

I know it’s hard to replicate this issue and therefore hard to fix, but I was wondering if there was something obvious I am doing wrong or if others face the same issue?

First thing to try, after removing the bits which are not compressed backup files, is to reduce the max backups to 2.

I think in any case it’s best to have some way to fetch these backups to somewhere else - your home for example. If you had a problem with your hoster and they deleted your account, you’d presently be left with nothing. Similarly if your account was compromised, and perhaps also if they had a fire.

Once you have some way - perhaps manual, perhaps automatic - to get offsite copies, you will be very close to having a way to check for and delete fragments.

I’ve suggested before that the dashboard should warn if the backup files have not been read for several days since their creation. That’s a relatively easy check, and in my view a good proxy for checking there’s an offsite copy.

You can also choose to put your backups into block storage, and you could do that using a different provider. Then you’d be less likely to lose both your installation and your backups.

I think there’s long-pending work to do which would not need to have both the backup and briefly the compressed backup file, but it’s not worth waiting for that. In the meantime, you need space for the N backups you are retaining, plus 1 for the backup being made in uncompressed form, plus 1 for the compressed backup which you need briefly before the oldest of the N is deleted.

You need disk space for N+2 backups, and if a backup fails you need to delete the bits.

5 Likes

You need to see that you also put that directory on your 300GB partition. That’s the one that’s filling the disk.

You could also consider moving uploads to that partition.

1 Like

Do you know off-hand how to do that? Is there a yml setting or something I need to change?

I also have it set up to have a static offline screen when rebuilding so I don’t know if that complicates things

Something like

volumes:
  - volume:
      host: /your/big/partition/tmp
      guest: /var/www/discourse/tmp

Presumably you’re doing something like that already to get the backups on the big partition?

It does. It’s probably not the problem, unless the problem is that it keeps showing the static offline page even though Discourse is up.

1 Like

I found out after making this topic that you need to run a command on the console when you expand your digital ocean volume. So effectively I was not using all of my 300gb.

I fixed that and changed nothing else and I had the problem reoccur today. There were 2 unzipped tar files and 3 gzipped ones when my site went down.

I will try the strategy discussed above next.

But what I wanted to say is that it would be nice to have an indicator on the admin UI that there are failed backups. Or maybe clear out any *.tar when triggering a new backup process? In this case, I had 90GB of unzipped backups that can’t be seen from the admin UI. And also no “backup failed” DM from either

2 Likes

How’s the memory usage on your droplet? The backup process should run clean-up routines and send a notification to admins when it fails. That won’t happen if the process gets terminated by the out-of-memory killer.

Maybe that’s what’s happening! I’ve seen this “interrupted backups leave partial backups that fill the disk” scenario on a few sites. My best explanation has been an OS reboot in the middle of a backup, but I’ve seen it where there are no OS reboots…

Having the backup process get terminated by out-of-memory seems like a likely explanation that is sufficiently hard to replicate that it could explain this.

. . . .

Oh. Darn. One site that I remembered having this problem has 16GB of ram, so I don’t think that explains it. On that site the issue is that every week or so a backup is left on the local disk after it gets (or maybe does not get) pushed to S3. They also have over 100GB of free disk space, so it takes months for the issue to become a big enough problem that the disk gets full.

Here’s the set of files I just deleted:

forum-2024-03-11-123904-v20240202052058.tar.gz
forum-2024-03-09-123159-v20240202052058.tar.gz                           
forum-2024-03-07-123727-v20240202052058.tar.gz                           
forum-2024-03-05-123019-v20240202052058.tar.gz
forum-2024-03-03-123934-v20240202052058.tar.gz  

+1 to that, the forum I help run randomly has backups left on the server instead of pushed to S3, and it’s brought the forum down at least once.

1 Like

Not sure if this is helpful but here’s the metrics from DO

7 day

24h

Zoom in