Discourse failed to backup on stable


#1

Continuing:

which still happens for us on stable, shortly after updating to version 1.6.6.

We have no automatic backup since November 5th, 2016 and are unable to initiate a manual backup:

[2016-11-08 20:09:23] 'Guiwy' has started the backup!
[2016-11-08 20:09:23] Marking backup as running...
[2016-11-08 20:09:23] Making sure '/var/www/discourse/tmp/backups/default/2016-11-08-200923' exists...
[2016-11-08 20:09:23] Making sure '/var/www/discourse/public/backups/default' exists...
[2016-11-08 20:09:23] Backup process was cancelled!
[2016-11-08 20:09:23] Notifying 'Guiwy' of the end of the backup...

Any ideas?


Additionally and it may be related, we experienced a constant increase in CPU load since this update:

and many page loading issue today. Rebuild and reboot helped but for how long?


(Rafael dos Santos Silva) #2

The fixed bug in 1.6.6 was when the backup failed on the end, when notifying for success.

This one looks different, and I couldn’t repro here.


(Jeff Atwood) #3

What plugins are you running? Can you disable all third party plugins and rebuild?


#4

@Falco, it seems like the same issue. It also logs:

Notifying 'username' of the end of the backup...

if that is important. I completed the logs in the first post.


We are running:

  • discourse-spoiler-alert
  • discourse-solved

so it’s pretty safe in that regard.


Also got in the logs:

Sidekiq heartbeat test failed, restarting

174 times.


(pjv) #5

I just got a high CPU alert from my server and seem to be seeing the same issue after updating to 1.6.6 a couple days ago. No nightly backups since then and the same
Sidekiq heartbeat test failed, restarting
in the error log.

I’m rebuilding app now…

After rebuild, still high CPU. Went into Sidekiq and as has happened a couple times in the past out of nowhere tens of thousands of user emails queued up and the app is using all its resources to generate and send them. No idea why. I deleted all the queued emails and CPU usage seems to have settled down now.

I tried to kick off a manual backup and nothing seems to be happening - I’m just getting a spinning indicator and the phrase No logs yet... nothing happening in sidekiq.


(Sam Saffron) #6

When did you last rebuild?


(pjv) #7

Can’t remember exactly but recently enough that I was able to type sudo ./laun and up-arrow a few times to get to sudo ./launcher rebuild app (iow, it was still in my shell history).


(pjv) #8

Little more data. When I go into the backup page, it thinks that a backup is in progress - there is an active cancel button. I push that button and it looks like it is canceling (asks me to confirm, cancel button goes away, backup button appears in its place), but then if I reload the backup page, the cancel button is again active. There is no backup actually happening, but some flag somewhere thinks there is and doesn’t seem clearable.


(Régis Hanol) #9

Can you do a “redis-cli flushall” from inside the container?


(pjv) #10

I could do that. I’d want to know what it’s supposed to do first. Does discourse use redis to cache objects and is the theory that there is something bad in that cache?


(Régis Hanol) #11

It will erase everything stored in redis.
You will potentially lose some pending email notifications but that’s about it.

We use redis to store the “read-only” and “backing up” state. Due to the bug you were experiencing, it looks like it’s still thinking you’re doing a backup, when you’re not.


(pjv) #12

OK, i flushed the redis cache and it no longer thinks it is backing up.

Error log: Sidekiq heartbeat test failed, restarting

Now what?


(Régis Hanol) #13

Is that a new or an old error?


(pjv) #14

It’s the only thing I see in the error log post the redis flush. Just one entry.


#15

Exactly the same behavior for us.

Yesterday we rebuilt and it no more seemed that a backup process was running. However, this morning the Cancel button was again active with no new nightly backup.

The same error also appears regularly in the logs.

We are starting to live on the edge with no backups for this many days and no clear workaround for now…


#16

CPU load is on the rise again. :unamused:

On the Sidekiq interface, we see:

SIDEKIQ IS PAUSED!

But it is clearly not and I would even say that it is doing way too much work: several hundreds jobs per minutes. There is also a backlog of 500+ planned jobs which seems to be slowly growing.

Any help is appreciated. :persevere:


(Régis Hanol) #17

If sidekiq is paused, then it won’t process any new scheduled jobs.

You can unpause it with

cd /var/discourse
./launcher enter app
rails c
Sidekiq.unpause!

(pjv) #18

Post the rebuild and the redis purge I kicked off a manual backup which successfully created a new backup, but then got stuck on “notifying sysadmin of finishing the backup” (or whatever it says exactly).

After which, again, cancel button -> looks like it works -> refresh page cancel button active again…

Went into Sidekiq, looked at scheduler tab which said “Sidekiq Paused”. I unpaused it.

Still thinks it is doing a backup. I’ll now purge the redis cache again.

Something is not working right since 1.6.6.


(Régis Hanol) #19

Is there any chances you could update to latest?


#20

Do you mean stable is not a supported branch?


Was a mini_racer update required on stable? Can it be reverted?