Backups are duplicating and not respecting number to keep on disk

@Wingtip

Our Discourse instance recently has started to misbehave with regards to backups. Among other things, it’s taking backups twice a day when it is set to take them every other day and it is not respecting the setting we have in place to only maintain a single backup on disk. This is causing the disk to fill up with large backups, leading to subsequent backup failures. Here are our settings:

Here’s some of the recent backup activity:

This results in inadequate space to gzip the backup files. This is from the end of the log for the most recent failed backup:

...
[2019-01-24 04:02:58] Archiving data dump...
[2019-01-24 04:03:13] Archiving uploads...
[2019-01-24 04:06:38] Removing tmp '/var/www/discourse/tmp/backups/default/2019-01-24-035429' directory...
[2019-01-24 04:06:38] Gzipping archive, this may take a while...
[2019-01-24 04:25:53] EXCEPTION: /var/www/discourse/lib/backup_restore/backuper.rb:249:in `create_archive': Failed to gzip archive.

gzip: /var/www/discourse/public/backups/default/quarter-to-three-forums-2019-01-24-035429-v20190110201340.tar.gz: No space left on device

[2019-01-24 04:25:53] /var/www/discourse/lib/discourse.rb:29:in `execute_command'
/var/www/discourse/lib/backup_restore/backuper.rb:249:in `create_archive'
/var/www/discourse/lib/backup_restore/backuper.rb:45:in `run'
/var/www/discourse/lib/backup_restore/backup_restore.rb:16:in `backup!'
/var/www/discourse/app/jobs/regular/create_backup.rb:8:in `execute'
/var/www/discourse/app/jobs/base.rb:137:in `block (2 levels) in perform'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/rails_multisite-2.0.4/lib/rails_multisite/connection_management.rb:63:in `with_connection'
/var/www/discourse/app/jobs/base.rb:127:in `block in perform'
/var/www/discourse/app/jobs/base.rb:123:in `each'
/var/www/discourse/app/jobs/base.rb:123:in `perform'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:187:in `execute_job'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:169:in `block (2 levels) in process'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/middleware/chain.rb:128:in `block in invoke'
/var/www/discourse/lib/sidekiq/pausable.rb:81:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/middleware/chain.rb:130:in `block in invoke'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/middleware/chain.rb:133:in `invoke'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:168:in `block in process'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:139:in `block (6 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/job_retry.rb:98:in `local'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:138:in `block (5 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq.rb:36:in `block in <module:Sidekiq>'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:134:in `block (4 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:199:in `stats'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:129:in `block (3 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/job_logger.rb:8:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:128:in `block (2 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/job_retry.rb:73:in `global'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:127:in `block in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/logging.rb:48:in `with_context'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/logging.rb:42:in `with_job_hash_context'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:126:in `dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:167:in `process'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:85:in `process_one'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:73:in `run'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/util.rb:16:in `watchdog'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/util.rb:25:in `block in safe_thread'
[2019-01-24 04:25:53] Deleting old backups...
[2019-01-24 04:25:53] Cleaning stuff up...
[2019-01-24 04:25:53] Removing '.tar' leftovers...
[2019-01-24 04:25:53] Unpausing sidekiq...
[2019-01-24 04:25:53] Marking backup as finished...
[2019-01-24 04:25:53] Refreshing disk stats...
[2019-01-24 04:25:55] Notifying 'system' of the end of the backup...

At the onset of this process, when we used the launcher to clean up and removed orphaned backups, there were 176 Gb of available space on the disk. This is what we see this morning:

root@forum:/var/discourse/shared/standalone/backups/default# ls -lah
total 121G
drwxr-xr-x 2 username www-data 4.0K Jan 23 23:25 .
drwxr-xr-x 3 username www-data 4.0K Jul 17 2018 …
-rw-r–r-- 1 username www-data 31G Jan 22 23:03 quarter-to-three-forums-2019-01-23-035137-v20190110201340.tar.gz
-rw-r–r-- 1 username www-data 31G Jan 22 23:38 quarter-to-three-forums-2019-01-23-042634-v20190110201340.tar.gz
-rw-r–r-- 1 username www-data 31G Jan 23 00:08 quarter-to-three-forums-2019-01-23-045655-v20190110201340.tar.gz
-rw-r–r-- 1 username www-data 31G Jan 23 00:39 quarter-to-three-forums-2019-01-23-052716-v20190110201340.tar.gz
root@forum:/var/discourse/shared/standalone/backups/default# du -sh
121G .

Any idea what’s going on?

It looks like the last backup that was successfully uploaded to S3 is from January 18. So, it makes sense that the system tries to create a new backup. For some reason it failed to upload any new backups since then.

Do you have any idea why the backups have different sizes? I see a 30GB and 16GB backup for each day. Are you by any chance using the same bucket for two Discourse instances? Maybe a development environment or a staging server? If so, that doesn’t work! You’d at least need to put the backups in different folders if you want to reuse the bucket.

Did you receive any messages from the system user about failed automatic backups? Can you make sure that you have enough free space on the local storage for a backup and manually start the backup process? Watch the backup logs for errors that might give you a clue why the upload to S3 fails.

5 Likes

Yes, the partial log that shows that gzip failed is from that message. We’ve been manually cleaning out and re-backing up. I have no idea why they are different sizes. I’ll trigger a manual backup now.

The log shows a “No space left on device”. That doesn’t really explain why previous uploads to S3 failed unless you really were low on disk space since January 18.

You certainly have two different Discourse instances backing up to that bucket. Take a look at the filenames. The 30GB backups are created by a recent version of Discourse (20190110) and the 16GB backups by an older version (20180710).

1 Like

The uploads to S3 aren’t failing… Here’s a screenshot of the S3 bucket (latest entries):

You should see the same list of backups in /admin/backups. If the latest backup isn’t visible at the top of the list then we found the reason why Discourse tries to create new backups over and over again.

1 Like

The same list of backups are in the S3 bucket and in the /admin/backups list. I just ran a manual backup and it worked flawlessly. It seems to be something about the automated system and/or Sidekiq job that’s causing the issue. I just checked Sidekiq and it seemed to be humming along happily. Is there a way to see a list of which tasks failed in Sidekiq?

To clarify, I just dropped the number of retained backups on disk from 2 to 1 this morning. Should not impact anything else, just wanted to make that clear.

You should find errors in /logs. But my gut tells me that the root cause might as well be the second Discourse instance that uploaded the 30GB backups up until January 18. You really should look into this.

1 Like

I just ran docker ps and only one container is running. Can two discourse instances run inside the same container somehow?

Note we have a backup from that 2018 version from today,

quarter-to-three-forums-2019-01-24-041654-v20180710075119.tar.gz

The 2018 version backups are 16GB and the 2019 ones are 30GB. We’re running v2.2.0.beta9 +18.

I don’t think so, but who knows. I’ve heard of the strangest Docker bugs… Or maybe there’s a rogue sidekiq process running with old code? :man_shrugging: You could restart or rebuild the container if you want to make sure everything inside the container works correctly.

That means that the backups with v20190110201340 in the filename are the correct ones and you’ll have to figure out where the other ones are coming from.

2 Likes

That is weird. I’ll just create a new bucket on S3 and remove access to the old one to see what happens.

1 Like

I do not see multiple sidekiq processes running.

root@forum:~# ps auxww|grep -i side|grep -v grep
forum 19222 1.1 2.3 2031300 384464 ? SNl Jan23 22:48 sidekiq 5.1.3 discourse [0 of 5 busy]

I also only see one nginx master, one postgres main, etc. Note the files Clay posted in the first message, the four backups at 30GB apiece, were on local disk. There were no 16GB backups there.

Two separate questions here:

a) Why weren’t those 4 backups on local disk deleted, why wasn’t the previous setting of retaining 2 on disk (now changed to 1) honored? I had to delete them myself.

b) Why are we taking two backups from different discourse versions, when it looks like only one is running? Maybe moving S3 buckets will answer that question, if not we’ll do a rebuild and restart also.

The maximum backups setting applies only to the backup location (S3 in your case) and since you disabled the S3 bucket cleanup it’s ignored completely.

You shouldn’t see a backup file on the local disk after the backup job finishes when the backup location is set to S3. It doesn’t matter if the backup job fails or is successful. The backup is only stored on the local disk as a temp file during the backup.

1 Like

Ahh, did not realize that. So why did we have four backups on local disk, if they’re supposed to be deleted whether the backup job succeeds or not?

I think it might silently ignore any error that happens during that cleanup. I’ll check that later. That might have been my mistake.

2 Likes

Thanks. We think we figured out the older, smaller backups. Someone else may have an old copy of the forum running somewhere with the s3 creds.

We do still have the failed backups sitting on local disk, the 30GB ones. Looks like it’s a separate issue.

1 Like

It actually didn’t delete the temporary backup file in all cases and it didn’t log an error when the deletion failed. I made some changes so that it always tries to delete the backup file and if it fails you will see it in /logs.

https://github.com/discourse/discourse/commit/c94a2bc69b9fcbc39435066a35f0263390a8257c

8 Likes

Nice! Is that in now, should we update?

Yes, it’s in tests-passed. I guess it wouldn’t hurt to update. It might help in finding the cause for the problem.

3 Likes