Today we found a number of (independent) servers running into problems at the same time.
Symptoms: read only mode was enabled and the disk was full.
This turned out to be caused by the backup scheduler creating a LOT of backups. The disk filled up so quickly that our monitoring didn’t even catch it in time.
Excerpt from sidekiq.log below.
This didn’t happen on all servers, but on about 8% of them. The issues started independently just after 2:00 AM EST, i.e. just when daylight savings time ended. I can’t explain why this is causing it, but I do think it’s related.
Happened both on 1.4.2 and on 1.5.0beta3 servers.
[STARTED]
'system' has started the backup!
Marking backup as running...
Making sure '/var/www/discourse/tmp/backups/db2155/2015-11-01-020020' exists...
Making sure '/var/www/discourse/public/backups/db2155' exists...
Writing metadata to '/var/www/discourse/tmp/backups/db2155/2015-11-01-020020/meta.json'...
...
(lots of backup logs)
...
Unpausing sidekiq...
Marking backup as finished...
Finished!
[SUCCESS]
[STARTED]
'system' has started the backup!
Marking backup as running...
Making sure '/var/www/discourse/tmp/backups/db2155/2015-11-01-020143' exists...
Making sure '/var/www/discourse/public/backups/db2155' exists...
Writing metadata to '/var/www/discourse/tmp/backups/db2155/2015-11-01-020143/meta.json'...
[STARTED]
'system' has started the backup!
Marking backup as running...
Making sure '/var/www/discourse/tmp/backups/db1714/2015-11-01-020143' exists...Enabling readonly mode...
Making sure '/var/www/discourse/public/backups/db1714' exists...
Writing metadata to '/var/www/discourse/tmp/backups/db1714/2015-11-01-020143/meta.json'...
Pausing sidekiq...
Waiting for sidekiq to finish running jobs...
Dumping the public schema of the database...
[STARTED]
'system' has started the backup!
Marking backup as running...
Making sure '/var/www/discourse/tmp/backups/db1714/2015-11-01-020143' exists...
Making sure '/var/www/discourse/public/backups/db1714' exists...
Writing metadata to '/var/www/discourse/tmp/backups/db1714/2015-11-01-020143/meta.json'..
We have learned to never schedule cronjobs between 2:00 AM and 3:00 AM because of this.
But then again, I can’t figure out how this causes Sidekiq to behave like this.
That would be a good workaround, but doesn’t solve the root issue. Apparently something goes nuts - 2:00AM happening twice doesn’t explain why multiple backups started to run simultaneously. So there must be an additional bug in Sidekiq that got triggered somehow.
I’ve seen similar problems result from collecting the time from one source, but the time zone from another. This looks like the backup could be running at 2 AM EST, but getting saved as 2 AM EDT, then running again because it doesn’t think the 2 AM EST scheduled backup has run.