Discourse failed to backup on stable

gkln · November 8, 2016, 9:46pm

Continuing:

which still happens for us on stable, shortly after updating to version 1.6.6.

We have no automatic backup since November 5th, 2016 and are unable to initiate a manual backup:

[2016-11-08 20:09:23] 'Guiwy' has started the backup!
[2016-11-08 20:09:23] Marking backup as running...
[2016-11-08 20:09:23] Making sure '/var/www/discourse/tmp/backups/default/2016-11-08-200923' exists...
[2016-11-08 20:09:23] Making sure '/var/www/discourse/public/backups/default' exists...
[2016-11-08 20:09:23] Backup process was cancelled!
[2016-11-08 20:09:23] Notifying 'Guiwy' of the end of the backup...

Any ideas?

Additionally and it may be related, we experienced a constant increase in CPU load since this update:

and many page loading issue today. Rebuild and reboot helped but for how long?

Falco · November 8, 2016, 9:50pm

The fixed bug in 1.6.6 was when the backup failed on the end, when notifying for success.

This one looks different, and I couldn’t repro here.

codinghorror · November 8, 2016, 9:51pm

What plugins are you running? Can you disable all third party plugins and rebuild?

gkln · November 8, 2016, 9:54pm

@Falco, it seems like the same issue. It also logs:

Notifying 'username' of the end of the backup...

if that is important. I completed the logs in the first post.

We are running:

discourse-spoiler-alert
discourse-solved

so it’s pretty safe in that regard.

Also got in the logs:

Sidekiq heartbeat test failed, restarting

174 times.

pjv · November 9, 2016, 2:49am

I just got a high CPU alert from my server and seem to be seeing the same issue after updating to 1.6.6 a couple days ago. No nightly backups since then and the same
Sidekiq heartbeat test failed, restarting
in the error log.

I’m rebuilding app now…

After rebuild, still high CPU. Went into Sidekiq and as has happened a couple times in the past out of nowhere tens of thousands of user emails queued up and the app is using all its resources to generate and send them. No idea why. I deleted all the queued emails and CPU usage seems to have settled down now.

I tried to kick off a manual backup and nothing seems to be happening - I’m just getting a spinning indicator and the phrase No logs yet... nothing happening in sidekiq.

sam · November 9, 2016, 3:31am

When did you last rebuild?

pjv · November 9, 2016, 4:06am

Can’t remember exactly but recently enough that I was able to type sudo ./laun and up-arrow a few times to get to sudo ./launcher rebuild app (iow, it was still in my shell history).

pjv · November 9, 2016, 1:43pm

Little more data. When I go into the backup page, it thinks that a backup is in progress - there is an active cancel button. I push that button and it looks like it is canceling (asks me to confirm, cancel button goes away, backup button appears in its place), but then if I reload the backup page, the cancel button is again active. There is no backup actually happening, but some flag somewhere thinks there is and doesn’t seem clearable.

zogstrip · November 9, 2016, 1:45pm

Can you do a “redis-cli flushall” from inside the container?

pjv · November 9, 2016, 1:47pm

I could do that. I’d want to know what it’s supposed to do first. Does discourse use redis to cache objects and is the theory that there is something bad in that cache?

zogstrip · November 9, 2016, 1:49pm

It will erase everything stored in redis.
You will potentially lose some pending email notifications but that’s about it.

We use redis to store the “read-only” and “backing up” state. Due to the bug you were experiencing, it looks like it’s still thinking you’re doing a backup, when you’re not.

pjv · November 9, 2016, 1:57pm

OK, i flushed the redis cache and it no longer thinks it is backing up.

Error log: Sidekiq heartbeat test failed, restarting

Now what?

zogstrip · November 9, 2016, 1:59pm

Is that a new or an old error?

pjv · November 9, 2016, 2:01pm

It’s the only thing I see in the error log post the redis flush. Just one entry.

gkln · November 9, 2016, 2:26pm

Exactly the same behavior for us.

Yesterday we rebuilt and it no more seemed that a backup process was running. However, this morning the Cancel button was again active with no new nightly backup.

The same error also appears regularly in the logs.

We are starting to live on the edge with no backups for this many days and no clear workaround for now…

gkln · November 9, 2016, 7:44pm

CPU load is on the rise again.

On the Sidekiq interface, we see:

SIDEKIQ IS PAUSED!

But it is clearly not and I would even say that it is doing way too much work: several hundreds jobs per minutes. There is also a backlog of 500+ planned jobs which seems to be slowly growing.

Any help is appreciated.

zogstrip · November 9, 2016, 9:15pm

If sidekiq is paused, then it won’t process any new scheduled jobs.

You can unpause it with

cd /var/discourse
./launcher enter app
rails c
Sidekiq.unpause!

pjv · November 9, 2016, 9:29pm

Post the rebuild and the redis purge I kicked off a manual backup which successfully created a new backup, but then got stuck on “notifying sysadmin of finishing the backup” (or whatever it says exactly).

After which, again, cancel button -> looks like it works -> refresh page cancel button active again…

Went into Sidekiq, looked at scheduler tab which said “Sidekiq Paused”. I unpaused it.

Still thinks it is doing a backup. I’ll now purge the redis cache again.

Something is not working right since 1.6.6.

zogstrip · November 9, 2016, 9:41pm

Is there any chances you could update to latest?

gkln · November 9, 2016, 10:04pm

Do you mean stable is not a supported branch?

https://github.com/discourse/discourse/commit/98d87a3ed21559318e34e21fc8eb0dd0682b201c

https://github.com/discourse/discourse/commit/90ef5770372be8b564b558f181bd834517d34617

Was a mini_racer update required on stable? Can it be reverted?

Topic		Replies	Views
Discourse failed to backup, how to debug? Support	15	2990	November 2, 2016
Not able to Backup. Backup process was cancelled! Support	3	1206	November 1, 2016
Hangs discourse and discourse restore FAILED Installation	24	1951	August 26, 2020
Problem when updating Discourse Forum Installation unsupported-install	20	1949	June 29, 2023
Discourse offline after failed upgrade "bootstrap failed with exit code 5" Installation	24	1237	February 22, 2024

Discourse failed to backup on stable

Related topics