Discourse failed to backup, how to debug?

trandatnh · October 29, 2016, 11:05pm

Hi,

My discourse keeps failing backup for the last 2 days

Here is the backup logs:

[2016-10-29 22:58:43] Making sure ‘/var/www/discourse/tmp/backups/default/2016-10-29-225843’ exists…
[2016-10-29 22:58:43] Backup process was cancelled!
[2016-10-29 22:58:43] Notifying ‘ltd’ of the end of the backup…

The directory /var/www/discourse/tmp/backups/default/2016-10-29-225843 exists

At the parent directory, I found 5 empty directories. I guess 4 are created manually and 1 automatically created.

root@daynhauhoc-app:/var/www/discourse/tmp/backups/default# ls -lrth
total 20K
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 07:17 2016-10-29-071717
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 08:00 2016-10-29-080055
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 08:14 2016-10-29-081441
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 21:08 2016-10-29-210807
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 22:58 2016-10-29-225843

I’m using 1.7.0.beta6, latest commit on 28 Oct 16.

I have done a successfully manual backup right before this issue.
Before that the automatically backup worked fine, but not now.

I got no ideas how to debug this issue.

Thanks

Steven · October 30, 2016, 1:39am

I have some weird behavior too. The first time I launch a backup nothing happens, and when I cancel it’s on the “notifying *** of the end of the backup”.

But on another Discourse up to date (i’m on this commit), I didn’t have any issue during the backup (except I never received the notification)

I don’t think plugins can impact backups, but here are mine

          - git clone https://github.com/discourse/docker_manager.git
          - git clone https://github.com/iunctis/iunctis-toolbar.git
          - git clone https://github.com/discourse/discourse-spoiler-alert.git
          - git clone https://github.com/iunctis/vb_emoji.git
          - git clone https://github.com/iunctis/discourse-affiliate.git

But I had no issue during my backup last night and I didn’t do any upgrade since then. Weird.

Falco · October 30, 2016, 3:15pm

Can confirm manual backups are broken, and I think this only affects non-English boards, what’s the language of your boards?

thess · October 30, 2016, 5:14pm

@trandatnh - I am having the same problem since updating to 1.7.0-beta6 (first upgrade since pre-beta). There seems to be no clear reasons in any logs indicating what may be failing as you have observed. I do notice that after the auto-backup starts, it appears to not complete. Cancelling the backup doesn’t really seem to be doing anything as attempting a manual backup just says “An operation is currently running. Can’t start a new job right now.”

@Falco - Our forum is English.

Steven · October 30, 2016, 5:27pm

French for me

I’ll try in english tonight, I’ll let you know

trandatnh · October 31, 2016, 12:28am

@Falco It’s English board if you mean the default language. Although we discuss in Vietnamese

This morning is worst, it automatically backup and failed to complete which forced the site to go to read-only mode. Cancel backup from the admin panel doesn’t work.

I’m running ./launcher rebuild app and hope it will cease the backup for now.

It works, I disabled scheduling backup It’s bad to have no backup every day.

Exactly, I have the same problem this morning as I mentioned above.

Lapinot · October 31, 2016, 5:11am

Confirming, I can reproduce this failure (actually I’m struggling with backups too, see here). Generally speaking, I find it quite questionable that the tests-passed branch is the default one in the yaml container file…

leomakkinje · November 1, 2016, 9:54am

Same issue here. Since updating to “latest-release +95 1.7.0.beta6” there seems to be a stuck or broken backup process.

I first noticed it a day after the update when I received a notice about high CPU load on our forum VPS. Tried a reboot, didn’t fix. The top command lists a Ruby process that’s constantly consuming around 100% CPU.

Then I tried “cd /var/discourse; git pull; ./launcher rebuild app; ./launcher cleanup”. It initially failed because the database couldn’t be shutdown. A second attempt did work. The site was accessible again and no data was lost.

Then I tried to perform a manual backup. Normally I’d see a list of log messages but now all I see is a spinning wheel. Clicking Cancel gets the site out of ReadOnly mode for a few seconds and then it goes back to ReadOnly.

I’m going to do another rebuild and switch off automatic backups and see if that gets the site a) in permanent readwrite mode, b) keeps CPU load at a normal level. And then pray that a) the forum doesn’t crash, b) the bug gets fixed soon.

BTW, is there any way to roll back to a previous version, before 1.7.0.beta6?

zogstrip · November 1, 2016, 10:22am

We’re aware of the issues regarding backups and are actively working on fixing it. We have identified the problem but not the cause yet. It’s high on our list and will be fixed by the end of the week (hopefully sooner).

ewanly · November 1, 2016, 12:20pm

Mine is English but facing the same problem

eviltrout · November 1, 2016, 8:41pm

Okay I’ve spent over a day on this now. No solution but have narrowed it down and I think @sam needs to look at it. I believe the bug is in mini_racer, which seems to be crashing randomly during transpilation.

To reproduce the bug in development mode:

rm -rf tmp
redis-cli flushall
Create a backup in /admin/backups

It should crash on “Notifying ‘eviltrout’ of the end of the backup…”

The process will be using 100% CPU, and you need to kill it before testing again.

Notes:

The backup takes place in a fork from unicorn which is forked. I think this is important as running it from a rails console does not create the same issue. If you recall, we were able to crash discourse altogether when we were fiddling with PrettyText warming up before forking. I think miniracer is a little delicate when being forked in our app.
The file it crashes while transpiling changes. Sometimes it’s the first file, sometimes it’s the fourth file, etc.
Because some files will succeed, if you don’t rm -rf tmp the site will eventually start working as it will have cached all the files it needs to transpile. This is why it took forever to debug, because it would eventually fix itself!
The mini racer is supposed to have a 15s timeout on eval, but even if you wait 15s it never continues.

sam · November 2, 2016, 2:42am

Should be fixed per:

https://github.com/discourse/discourse/commit/7e43e73df69a5ca70e7f4546465525c7392612fb

After we forked we correctly reset the v8 context on pretty text, but the transpiler and js locale helper had v8 context from parent.

Since v8 is not fork safe (and probably never will be) we must clear all our v8 context after forking.

Long term we should probably extend MiniRacer to allow it to “manually” free up all v8 contexts prior to forking and call a custom fork command, cause ideally prior to fork is the best way to do it. Also ruby really should give us a hook that we can call prior to forking.

Sadly, this has been on the back burner for so so long:

https://bugs.ruby-lang.org/issues/5446

sam · November 2, 2016, 6:22am

Sadly, not full sorted out … which is very odd, will continue to debug this

eviltrout · November 2, 2016, 2:26pm

Update: Sam pushed another fix that does seem to fix this problem. I confirmed it is working this morning.

ewanly · November 2, 2016, 4:05pm

Yes, I just tried and its perfectly working Thank you!

Topic		Replies	Views
Discourse failed to backup on stable Support	35	2890	November 16, 2016
Not able to Backup. Backup process was cancelled! Support	3	1206	November 1, 2016
How to backup Discourse when The backup has failed? Installation	8	934	August 5, 2020
I can't get a good backup - it fails after after_create_hook Bug	17	2768	September 21, 2016
Nightly backup fails Support	2	529	February 18, 2019

Discourse failed to backup, how to debug?

Related topics