Discourse failed to backup, how to debug?

Hi,

My discourse keeps failing backup for the last 2 days

Here is the backup logs:

[2016-10-29 22:58:43] Making sure ‘/var/www/discourse/tmp/backups/default/2016-10-29-225843’ exists…
[2016-10-29 22:58:43] Backup process was cancelled!
[2016-10-29 22:58:43] Notifying ‘ltd’ of the end of the backup…

The directory /var/www/discourse/tmp/backups/default/2016-10-29-225843 exists

At the parent directory, I found 5 empty directories. I guess 4 are created manually and 1 automatically created.

root@daynhauhoc-app:/var/www/discourse/tmp/backups/default# ls -lrth
total 20K
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 07:17 2016-10-29-071717
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 08:00 2016-10-29-080055
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 08:14 2016-10-29-081441
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 21:08 2016-10-29-210807
drwxr-xr-x 2 discourse www-data 4.0K Oct 29 22:58 2016-10-29-225843

I’m using 1.7.0.beta6, latest commit on 28 Oct 16.

https://github.com/discourse/discourse/commit/4d58a00387f6c911b3d35c126289552f93044acc

I have done a successfully manual backup right before this issue.
Before that the automatically backup worked fine, but not now.

I got no ideas how to debug this issue.

Thanks

1 Like

I have some weird behavior too. The first time I launch a backup nothing happens, and when I cancel it’s on the “notifying *** of the end of the backup”.

But on another Discourse up to date (i’m on this commit), I didn’t have any issue during the backup (except I never received the notification)

I don’t think plugins can impact backups, but here are mine

          - git clone https://github.com/discourse/docker_manager.git
          - git clone https://github.com/iunctis/iunctis-toolbar.git
          - git clone https://github.com/discourse/discourse-spoiler-alert.git
          - git clone https://github.com/iunctis/vb_emoji.git
          - git clone https://github.com/iunctis/discourse-affiliate.git

But I had no issue during my backup last night and I didn’t do any upgrade since then. Weird.

Can confirm manual backups are broken, and I think this only affects non-English boards, what’s the language of your boards?

2 Likes

@trandatnh - I am having the same problem since updating to 1.7.0-beta6 (first upgrade since pre-beta). There seems to be no clear reasons in any logs indicating what may be failing as you have observed. I do notice that after the auto-backup starts, it appears to not complete. Cancelling the backup doesn’t really seem to be doing anything as attempting a manual backup just says “An operation is currently running. Can’t start a new job right now.”

@Falco - Our forum is English.

2 Likes

French for me

I’ll try in english tonight, I’ll let you know

@Falco It’s English board if you mean the default language. Although we discuss in Vietnamese

This morning is worst, it automatically backup and failed to complete which forced the site to go to read-only mode. Cancel backup from the admin panel doesn’t work.

I’m running ./launcher rebuild app and hope it will cease the backup for now.

It works, I disabled scheduling backup :’( It’s bad to have no backup every day.

Exactly, I have the same problem this morning as I mentioned above.

Confirming, I can reproduce this failure (actually I’m struggling with backups too, see here). Generally speaking, I find it quite questionable that the tests-passed branch is the default one in the yaml container file…

Same issue here. Since updating to “latest-release +95 1.7.0.beta6” there seems to be a stuck or broken backup process.

I first noticed it a day after the update when I received a notice about high CPU load on our forum VPS. Tried a reboot, didn’t fix. The top command lists a Ruby process that’s constantly consuming around 100% CPU.

Then I tried “cd /var/discourse; git pull; ./launcher rebuild app; ./launcher cleanup”. It initially failed because the database couldn’t be shutdown. A second attempt did work. The site was accessible again and no data was lost.

Then I tried to perform a manual backup. Normally I’d see a list of log messages but now all I see is a spinning wheel. Clicking Cancel gets the site out of ReadOnly mode for a few seconds and then it goes back to ReadOnly.

I’m going to do another rebuild and switch off automatic backups and see if that gets the site a) in permanent readwrite mode, b) keeps CPU load at a normal level. And then pray that a) the forum doesn’t crash, b) the bug gets fixed soon.

BTW, is there any way to roll back to a previous version, before 1.7.0.beta6?

We’re aware of the issues regarding backups and are actively working on fixing it. We have identified the problem but not the cause yet. It’s high on our list and will be fixed by the end of the week (hopefully sooner).

9 Likes

Mine is English but facing the same problem

Okay I’ve spent over a day on this now. No solution but have narrowed it down and I think @sam needs to look at it. I believe the bug is in mini_racer, which seems to be crashing randomly during transpilation.

To reproduce the bug in development mode:

  1. rm -rf tmp
  2. redis-cli flushall
  3. Create a backup in /admin/backups

It should crash on “Notifying ‘eviltrout’ of the end of the backup…”

The process will be using 100% CPU, and you need to kill it before testing again.

Notes:

  • The backup takes place in a fork from unicorn which is forked. I think this is important as running it from a rails console does not create the same issue. If you recall, we were able to crash discourse altogether when we were fiddling with PrettyText warming up before forking. I think miniracer is a little delicate when being forked in our app.

  • The file it crashes while transpiling changes. Sometimes it’s the first file, sometimes it’s the fourth file, etc.

  • Because some files will succeed, if you don’t rm -rf tmp the site will eventually start working as it will have cached all the files it needs to transpile. This is why it took forever to debug, because it would eventually fix itself!

  • The mini racer is supposed to have a 15s timeout on eval, but even if you wait 15s it never continues.

6 Likes

Should be fixed per:

https://github.com/discourse/discourse/commit/7e43e73df69a5ca70e7f4546465525c7392612fb

After we forked we correctly reset the v8 context on pretty text, but the transpiler and js locale helper had v8 context from parent.

Since v8 is not fork safe (and probably never will be) we must clear all our v8 context after forking.

Long term we should probably extend MiniRacer to allow it to “manually” free up all v8 contexts prior to forking and call a custom fork command, cause ideally prior to fork is the best way to do it. Also ruby really should give us a hook that we can call prior to forking.

Sadly, this has been on the back burner for so so long:

https://bugs.ruby-lang.org/issues/5446

10 Likes

Sadly, not full sorted out … which is very odd, will continue to debug this

7 Likes

Update: Sam pushed another fix that does seem to fix this problem. I confirmed it is working this morning.

9 Likes

Yes, I just tried and its perfectly working :slight_smile: Thank you!

1 Like