Our Discourse instance recently has started to misbehave with regards to backups. Among other things, it’s taking backups twice a day when it is set to take them every other day and it is not respecting the setting we have in place to only maintain a single backup on disk. This is causing the disk to fill up with large backups, leading to subsequent backup failures. Here are our settings:
This results in inadequate space to gzip the backup files. This is from the end of the log for the most recent failed backup:
...
[2019-01-24 04:02:58] Archiving data dump...
[2019-01-24 04:03:13] Archiving uploads...
[2019-01-24 04:06:38] Removing tmp '/var/www/discourse/tmp/backups/default/2019-01-24-035429' directory...
[2019-01-24 04:06:38] Gzipping archive, this may take a while...
[2019-01-24 04:25:53] EXCEPTION: /var/www/discourse/lib/backup_restore/backuper.rb:249:in `create_archive': Failed to gzip archive.
gzip: /var/www/discourse/public/backups/default/quarter-to-three-forums-2019-01-24-035429-v20190110201340.tar.gz: No space left on device
[2019-01-24 04:25:53] /var/www/discourse/lib/discourse.rb:29:in `execute_command'
/var/www/discourse/lib/backup_restore/backuper.rb:249:in `create_archive'
/var/www/discourse/lib/backup_restore/backuper.rb:45:in `run'
/var/www/discourse/lib/backup_restore/backup_restore.rb:16:in `backup!'
/var/www/discourse/app/jobs/regular/create_backup.rb:8:in `execute'
/var/www/discourse/app/jobs/base.rb:137:in `block (2 levels) in perform'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/rails_multisite-2.0.4/lib/rails_multisite/connection_management.rb:63:in `with_connection'
/var/www/discourse/app/jobs/base.rb:127:in `block in perform'
/var/www/discourse/app/jobs/base.rb:123:in `each'
/var/www/discourse/app/jobs/base.rb:123:in `perform'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:187:in `execute_job'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:169:in `block (2 levels) in process'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/middleware/chain.rb:128:in `block in invoke'
/var/www/discourse/lib/sidekiq/pausable.rb:81:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/middleware/chain.rb:130:in `block in invoke'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/middleware/chain.rb:133:in `invoke'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:168:in `block in process'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:139:in `block (6 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/job_retry.rb:98:in `local'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:138:in `block (5 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq.rb:36:in `block in <module:Sidekiq>'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:134:in `block (4 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:199:in `stats'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:129:in `block (3 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/job_logger.rb:8:in `call'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:128:in `block (2 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/job_retry.rb:73:in `global'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:127:in `block in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/logging.rb:48:in `with_context'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/logging.rb:42:in `with_job_hash_context'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:126:in `dispatch'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:167:in `process'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:85:in `process_one'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:73:in `run'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/util.rb:16:in `watchdog'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/sidekiq-5.1.3/lib/sidekiq/util.rb:25:in `block in safe_thread'
[2019-01-24 04:25:53] Deleting old backups...
[2019-01-24 04:25:53] Cleaning stuff up...
[2019-01-24 04:25:53] Removing '.tar' leftovers...
[2019-01-24 04:25:53] Unpausing sidekiq...
[2019-01-24 04:25:53] Marking backup as finished...
[2019-01-24 04:25:53] Refreshing disk stats...
[2019-01-24 04:25:55] Notifying 'system' of the end of the backup...
At the onset of this process, when we used the launcher to clean up and removed orphaned backups, there were 176 Gb of available space on the disk. This is what we see this morning:
root@forum:/var/discourse/shared/standalone/backups/default# ls -lah
total 121G
drwxr-xr-x 2 username www-data 4.0K Jan 23 23:25 .
drwxr-xr-x 3 username www-data 4.0K Jul 17 2018 …
-rw-r–r-- 1 username www-data 31G Jan 22 23:03 quarter-to-three-forums-2019-01-23-035137-v20190110201340.tar.gz
-rw-r–r-- 1 username www-data 31G Jan 22 23:38 quarter-to-three-forums-2019-01-23-042634-v20190110201340.tar.gz
-rw-r–r-- 1 username www-data 31G Jan 23 00:08 quarter-to-three-forums-2019-01-23-045655-v20190110201340.tar.gz
-rw-r–r-- 1 username www-data 31G Jan 23 00:39 quarter-to-three-forums-2019-01-23-052716-v20190110201340.tar.gz
root@forum:/var/discourse/shared/standalone/backups/default# du -sh
121G .
It looks like the last backup that was successfully uploaded to S3 is from January 18. So, it makes sense that the system tries to create a new backup. For some reason it failed to upload any new backups since then.
Do you have any idea why the backups have different sizes? I see a 30GB and 16GB backup for each day. Are you by any chance using the same bucket for two Discourse instances? Maybe a development environment or a staging server? If so, that doesn’t work! You’d at least need to put the backups in different folders if you want to reuse the bucket.
Did you receive any messages from the system user about failed automatic backups? Can you make sure that you have enough free space on the local storage for a backup and manually start the backup process? Watch the backup logs for errors that might give you a clue why the upload to S3 fails.
Yes, the partial log that shows that gzip failed is from that message. We’ve been manually cleaning out and re-backing up. I have no idea why they are different sizes. I’ll trigger a manual backup now.
The log shows a “No space left on device”. That doesn’t really explain why previous uploads to S3 failed unless you really were low on disk space since January 18.
You certainly have two different Discourse instances backing up to that bucket. Take a look at the filenames. The 30GB backups are created by a recent version of Discourse (20190110) and the 16GB backups by an older version (20180710).
You should see the same list of backups in /admin/backups. If the latest backup isn’t visible at the top of the list then we found the reason why Discourse tries to create new backups over and over again.
The same list of backups are in the S3 bucket and in the /admin/backups list. I just ran a manual backup and it worked flawlessly. It seems to be something about the automated system and/or Sidekiq job that’s causing the issue. I just checked Sidekiq and it seemed to be humming along happily. Is there a way to see a list of which tasks failed in Sidekiq?
To clarify, I just dropped the number of retained backups on disk from 2 to 1 this morning. Should not impact anything else, just wanted to make that clear.
You should find errors in /logs. But my gut tells me that the root cause might as well be the second Discourse instance that uploaded the 30GB backups up until January 18. You really should look into this.
I don’t think so, but who knows. I’ve heard of the strangest Docker bugs… Or maybe there’s a rogue sidekiq process running with old code? You could restart or rebuild the container if you want to make sure everything inside the container works correctly.
That means that the backups with v20190110201340 in the filename are the correct ones and you’ll have to figure out where the other ones are coming from.
I also only see one nginx master, one postgres main, etc. Note the files Clay posted in the first message, the four backups at 30GB apiece, were on local disk. There were no 16GB backups there.
Two separate questions here:
a) Why weren’t those 4 backups on local disk deleted, why wasn’t the previous setting of retaining 2 on disk (now changed to 1) honored? I had to delete them myself.
b) Why are we taking two backups from different discourse versions, when it looks like only one is running? Maybe moving S3 buckets will answer that question, if not we’ll do a rebuild and restart also.
The maximum backups setting applies only to the backup location (S3 in your case) and since you disabled the S3 bucket cleanup it’s ignored completely.
You shouldn’t see a backup file on the local disk after the backup job finishes when the backup location is set to S3. It doesn’t matter if the backup job fails or is successful. The backup is only stored on the local disk as a temp file during the backup.
It actually didn’t delete the temporary backup file in all cases and it didn’t log an error when the deletion failed. I made some changes so that it always tries to delete the backup file and if it fails you will see it in /logs.