Backups keep taking down my forum

piffy · February 17, 2024, 5:00am

I keep going through cycles where the backup creation is taking down my site due to space issues.

I have set it up where I use a separate DO volume to store the backups and uploads. The volume is 300GB

My backup settings:

The problem has been reoccurring for months. I will get messages about the backup failing in the admin inbox (see below)

No space left on device

[2024-02-14 03:43:34] Finalizing backup...
[2024-02-14 03:43:34] Creating archive: elite-fourum-2024-02-14-033845-v20240204204532.tar.gz
[2024-02-14 03:43:34] Making sure archive does not already exist...
[2024-02-14 03:43:34] Creating empty archive...
[2024-02-14 03:43:34] Archiving data dump...
[2024-02-14 03:43:58] Archiving uploads...
[2024-02-14 03:55:03] Removing tmp '/var/www/discourse/tmp/backups/default/2024-02-14-033845' directory...
[2024-02-14 03:55:03] Gzipping archive, this may take a while...
[2024-02-14 04:25:38] EXCEPTION: /var/www/discourse/lib/discourse.rb:138:in `exec': Failed to gzip archive.

gzip: /var/www/discourse/public/backups/default/elite-fourum-2024-02-14-033845-v20240204204532.tar.gz: No space left on device

[2024-02-14 04:25:38] /var/www/discourse/lib/discourse.rb:172:in `execute_command'
/var/www/discourse/lib/discourse.rb:138:in `exec'
/var/www/discourse/lib/discourse.rb:34:in `execute_command'
/var/www/discourse/lib/backup_restore/backuper.rb:253:in `create_archive'
/var/www/discourse/lib/backup_restore/backuper.rb:40:in `run'
/var/www/discourse/lib/backup_restore.rb:13:in `backup!'
/var/www/discourse/app/jobs/regular/create_backup.rb:10:in `execute'
/var/www/discourse/app/jobs/base.rb:297:in `block (2 levels) in perform'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rails_multisite-5.0.0/lib/rails_multisite/connection_management.rb:82:in `with_connection'
/var/www/discourse/app/jobs/base.rb:284:in `block in perform'
/var/www/discourse/app/jobs/base.rb:280:in `each'
/var/www/discourse/app/jobs/base.rb:280:in `perform'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:202:in `execute_job'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:170:in `block (2 levels) in process'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/middleware/chain.rb:177:in `block in invoke'
/var/www/discourse/lib/sidekiq/pausable.rb:132:in `call'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/middleware/chain.rb:179:in `block in invoke'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/middleware/chain.rb:182:in `invoke'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:169:in `block in process'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:136:in `block (6 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/job_retry.rb:113:in `local'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:135:in `block (5 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq.rb:44:in `block in <module:Sidekiq>'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:131:in `block (4 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:263:in `stats'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:126:in `block (3 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/job_logger.rb:13:in `call'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:125:in `block (2 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/job_retry.rb:80:in `global'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:124:in `block in dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/job_logger.rb:39:in `prepare'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:123:in `dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:168:in `process'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:78:in `process_one'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/processor.rb:68:in `run'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/component.rb:8:in `watchdog'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/sidekiq-6.5.12/lib/sidekiq/component.rb:17:in `block in safe_thread'
[2024-02-14 04:25:38] Deleting old backups...
[2024-02-14 04:25:38] Cleaning stuff up...
[2024-02-14 04:25:38] Removing '.tar' leftovers...
[2024-02-14 04:25:39] Marking backup as finished...
[2024-02-14 04:25:39] Notifying 'system' of the end of the backup...

But then what will happen is I will get a series of partially(?) complete backups that I am not notified about afaik. They will accumulate until it gets to the point where a new backup is attempted to be created every day and it takes down my site at the same time.

I have to manually ssh in and remove these because they don’t show up on the /admin/backups page.

I know it’s hard to replicate this issue and therefore hard to fix, but I was wondering if there was something obvious I am doing wrong or if others face the same issue?

Ed_S · February 17, 2024, 8:25am

First thing to try, after removing the bits which are not compressed backup files, is to reduce the max backups to 2.

I think in any case it’s best to have some way to fetch these backups to somewhere else - your home for example. If you had a problem with your hoster and they deleted your account, you’d presently be left with nothing. Similarly if your account was compromised, and perhaps also if they had a fire.

Once you have some way - perhaps manual, perhaps automatic - to get offsite copies, you will be very close to having a way to check for and delete fragments.

I’ve suggested before that the dashboard should warn if the backup files have not been read for several days since their creation. That’s a relatively easy check, and in my view a good proxy for checking there’s an offsite copy.

You can also choose to put your backups into block storage, and you could do that using a different provider. Then you’d be less likely to lose both your installation and your backups.

I think there’s long-pending work to do which would not need to have both the backup and briefly the compressed backup file, but it’s not worth waiting for that. In the meantime, you need space for the N backups you are retaining, plus 1 for the backup being made in uncompressed form, plus 1 for the compressed backup which you need briefly before the oldest of the N is deleted.

You need disk space for N+2 backups, and if a backup fails you need to delete the bits.

pfaffman · February 19, 2024, 6:08pm

You need to see that you also put that directory on your 300GB partition. That’s the one that’s filling the disk.

You could also consider moving uploads to that partition.

piffy · February 19, 2024, 7:28pm

Do you know off-hand how to do that? Is there a yml setting or something I need to change?

I also have it set up to have a static offline screen when rebuilding so I don’t know if that complicates things

pfaffman · February 19, 2024, 7:42pm

Something like

volumes:
  - volume:
      host: /your/big/partition/tmp
      guest: /var/www/discourse/tmp

Presumably you’re doing something like that already to get the backups on the big partition?

It does. It’s probably not the problem, unless the problem is that it keeps showing the static offline page even though Discourse is up.

piffy · March 11, 2024, 4:13am

I found out after making this topic that you need to run a command on the console when you expand your digital ocean volume. So effectively I was not using all of my 300gb.

I fixed that and changed nothing else and I had the problem reoccur today. There were 2 unzipped tar files and 3 gzipped ones when my site went down.

I will try the strategy discussed above next.

But what I wanted to say is that it would be nice to have an indicator on the admin UI that there are failed backups. Or maybe clear out any *.tar when triggering a new backup process? In this case, I had 90GB of unzipped backups that can’t be seen from the admin UI. And also no “backup failed” DM from either

gerhard · March 11, 2024, 10:26am

How’s the memory usage on your droplet? The backup process should run clean-up routines and send a notification to admins when it fails. That won’t happen if the process gets terminated by the out-of-memory killer.

pfaffman · March 11, 2024, 2:25pm

Maybe that’s what’s happening! I’ve seen this “interrupted backups leave partial backups that fill the disk” scenario on a few sites. My best explanation has been an OS reboot in the middle of a backup, but I’ve seen it where there are no OS reboots…

Having the backup process get terminated by out-of-memory seems like a likely explanation that is sufficiently hard to replicate that it could explain this.

. . . .

Oh. Darn. One site that I remembered having this problem has 16GB of ram, so I don’t think that explains it. On that site the issue is that every week or so a backup is left on the local disk after it gets (or maybe does not get) pushed to S3. They also have over 100GB of free disk space, so it takes months for the issue to become a big enough problem that the disk gets full.

Here’s the set of files I just deleted:

forum-2024-03-11-123904-v20240202052058.tar.gz
forum-2024-03-09-123159-v20240202052058.tar.gz                           
forum-2024-03-07-123727-v20240202052058.tar.gz                           
forum-2024-03-05-123019-v20240202052058.tar.gz
forum-2024-03-03-123934-v20240202052058.tar.gz

Firepup650 · March 11, 2024, 2:31pm

+1 to that, the forum I help run randomly has backups left on the server instead of pushed to S3, and it’s brought the forum down at least once.

piffy · March 11, 2024, 3:08pm

Not sure if this is helpful but here’s the metrics from DO

7 day

24h

Zoom in

Topic		Replies	Views
How to backup Discourse when The backup has failed? Installation	8	934	August 5, 2020
Disk usage spike during backup, Discourse crashed hard :-( Installation server-resources	21	2555	July 23, 2020
I can't get a good backup - it fails after after_create_hook Bug	17	2768	September 21, 2016
When backup fails, delete the useless backup Feature	7	1377	December 15, 2017
Backups are duplicating and not respecting number to keep on disk Installation	68	2379	February 15, 2019

Backups keep taking down my forum

Related topics