Discourse failed to backup, how to debug?


(Lê Trần Đạt) #1

Hi,

My discourse keeps failing backup for the last 2 days

Here is the backup logs:

[2016-10-29 22:58:43] Making sure ‘/var/www/discourse/tmp/backups/default/2016-10-29-225843’ exists…
[2016-10-29 22:58:43] Backup process was cancelled!
[2016-10-29 22:58:43] Notifying ‘ltd’ of the end of the backup…

The directory /var/www/discourse/tmp/backups/default/2016-10-29-225843 exists

At the parent directory, I found 5 empty directories. I guess 4 are created manually and 1 automatically created.

root@daynhauhoc-app:/var/www/discourse/tmp/backups/default# ls -lrth
total 20K
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 07:17 2016-10-29-071717
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 08:00 2016-10-29-080055
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 08:14 2016-10-29-081441
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 21:08 2016-10-29-210807
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 22:58 2016-10-29-225843

I’m using 1.7.0.beta6, latest commit on 28 Oct 16.

I have done a successfully manual backup right before this issue.
Before that the automatically backup worked fine, but not now.

I got no ideas how to debug this issue.

Thanks


Not able to Backup. Backup process was cancelled!
Hangs during upgrade
Discourse failed to backup on stable
#2

I have some weird behavior too. The first time I launch a backup nothing happens, and when I cancel it’s on the “notifying *** of the end of the backup”.

But on another Discourse up to date (i’m on this commit), I didn’t have any issue during the backup (except I never received the notification)

I don’t think plugins can impact backups, but here are mine

          - git clone https://github.com/discourse/docker_manager.git
          - git clone https://github.com/iunctis/iunctis-toolbar.git
          - git clone https://github.com/discourse/discourse-spoiler-alert.git
          - git clone https://github.com/iunctis/vb_emoji.git
          - git clone https://github.com/iunctis/discourse-affiliate.git

But I had no issue during my backup last night and I didn’t do any upgrade since then. Weird.


(Rafael dos Santos Silva) #3

Can confirm manual backups are broken, and I think this only affects non-English boards, what’s the language of your boards?


(Ted Hess) #4

@trandatnh - I am having the same problem since updating to 1.7.0-beta6 (first upgrade since pre-beta). There seems to be no clear reasons in any logs indicating what may be failing as you have observed. I do notice that after the auto-backup starts, it appears to not complete. Cancelling the backup doesn’t really seem to be doing anything as attempting a manual backup just says “An operation is currently running. Can’t start a new job right now.”

@Falco - Our forum is English.


#5

French for me

I’ll try in english tonight, I’ll let you know


(Lê Trần Đạt) #6

@Falco It’s English board if you mean the default language. Although we discuss in Vietnamese

This morning is worst, it automatically backup and failed to complete which forced the site to go to read-only mode. Cancel backup from the admin panel doesn’t work.

I’m running ./launcher rebuild app and hope it will cease the backup for now.

It works, I disabled scheduling backup :’( It’s bad to have no backup every day.

Exactly, I have the same problem this morning as I mentioned above.


(Lapinot) #7

Confirming, I can reproduce this failure (actually I’m struggling with backups too, see here). Generally speaking, I find it quite questionable that the tests-passed branch is the default one in the yaml container file…


(Leo Makkinje) #8

Same issue here. Since updating to “latest-release +95 1.7.0.beta6” there seems to be a stuck or broken backup process.

I first noticed it a day after the update when I received a notice about high CPU load on our forum VPS. Tried a reboot, didn’t fix. The top command lists a Ruby process that’s constantly consuming around 100% CPU.

Then I tried “cd /var/discourse; git pull; ./launcher rebuild app; ./launcher cleanup”. It initially failed because the database couldn’t be shutdown. A second attempt did work. The site was accessible again and no data was lost.

Then I tried to perform a manual backup. Normally I’d see a list of log messages but now all I see is a spinning wheel. Clicking Cancel gets the site out of ReadOnly mode for a few seconds and then it goes back to ReadOnly.

I’m going to do another rebuild and switch off automatic backups and see if that gets the site a) in permanent readwrite mode, b) keeps CPU load at a normal level. And then pray that a) the forum doesn’t crash, b) the bug gets fixed soon.

BTW, is there any way to roll back to a previous version, before 1.7.0.beta6?


(Régis Hanol) #9

We’re aware of the issues regarding backups and are actively working on fixing it. We have identified the problem but not the cause yet. It’s high on our list and will be fixed by the end of the week (hopefully sooner).


(EW 👌) #10

Mine is English but facing the same problem


(Robin Ward) #11

Okay I’ve spent over a day on this now. No solution but have narrowed it down and I think @sam needs to look at it. I believe the bug is in mini_racer, which seems to be crashing randomly during transpilation.

To reproduce the bug in development mode:

  1. rm -rf tmp
  2. redis-cli flushall
  3. Create a backup in /admin/backups

It should crash on “Notifying ‘eviltrout’ of the end of the backup…”

The process will be using 100% CPU, and you need to kill it before testing again.

Notes:

  • The backup takes place in a fork from unicorn which is forked. I think this is important as running it from a rails console does not create the same issue. If you recall, we were able to crash discourse altogether when we were fiddling with PrettyText warming up before forking. I think miniracer is a little delicate when being forked in our app.

  • The file it crashes while transpiling changes. Sometimes it’s the first file, sometimes it’s the fourth file, etc.

  • Because some files will succeed, if you don’t rm -rf tmp the site will eventually start working as it will have cached all the files it needs to transpile. This is why it took forever to debug, because it would eventually fix itself!

  • The mini racer is supposed to have a 15s timeout on eval, but even if you wait 15s it never continues.


(Sam Saffron) #12

Should be fixed per:

After we forked we correctly reset the v8 context on pretty text, but the transpiler and js locale helper had v8 context from parent.

Since v8 is not fork safe (and probably never will be) we must clear all our v8 context after forking.

Long term we should probably extend MiniRacer to allow it to “manually” free up all v8 contexts prior to forking and call a custom fork command, cause ideally prior to fork is the best way to do it. Also ruby really should give us a hook that we can call prior to forking.

Sadly, this has been on the back burner for so so long:

https://bugs.ruby-lang.org/issues/5446


(Sam Saffron) #13

Sadly, not full sorted out … which is very odd, will continue to debug this


(Robin Ward) #14

Update: Sam pushed another fix that does seem to fix this problem. I confirmed it is working this morning.


(EW 👌) #15

Yes, I just tried and its perfectly working :slight_smile: Thank you!


(Robin Ward) #16