Forum is down, upgrade failed with fatal error, need help


Hey guys, I am a panicking. A little. I just tried to upgrade our forums. I was then asked to rebuild the app, which I did with:

 ./launcher rebuild app

That process fails with below error message and now the forum won’t come back up. Ubuntu 16.04, everything up to date, including docker-ce:

Pups::ExecError: cd /var/www/discourse/plugins && git clone failed with return #<Process::Status: pid 294 exit 128>
Location of failure: /pups/lib/pups/exec_command.rb:108:in `spawn'
exec failed with the params {"cd"=>"$home/plugins", "cmd"=>["mkdir -p plugins", "git clone", "git clone", "git clone", "git clone", "git clone", "git clone", "git clone", "git clone"]}
** FAILED TO BOOTSTRAP ** please scroll up and look for earlier error messages, there may be more than one

(Chris Beach) #2

I know how it feels to be in this situation :cold_sweat:

Unsure if this is the issue but the Discourse Presence plugin has been brought into Discourse core now.

You probably want to remove this plugin from your containers/app.yml


That did the trick. Thank you so much. Panic level reset to 0. :slight_smile:

However, I think something like this should never happen. Ever. There should be sanity checks in the upgrade scripts that remove plugins that are no longer compatible (or have been merged into core). IMHO of course.

(Chris Beach) #4

I’d be the first to agree with that statement.

Each of our forums is different. We all have different combinations of plugins, different data etc. I think some basic tests should be run automatically during the upgrade process, and the upgrade aborted if they fail.

A few simple automated tests, like “does the forum start up;” “can a user signup;” “can a post be added” - followed by the acceptance test suites of each of the installed plugins.

Third party plugins aren’t the responsibility of the Discourse core team, but there are surely some simple, general-purpose safeguards that can be added to the upgrade process with no expectation of further ongoing work from the core team, or the provision of any plugin-specific logic.

(cpradio) #5

Tests could not solve this problem, the only way this could have been solves is if the initial discourse-presence plugin location emptied its repo (Poll Plugin did that I think). However, doing that instantly breaks people who are on stable or beta versions of Discourse when they do a rebuild/upgrade.

(Chris Beach) #6

We’re using Docker. I may be wrong, but couldn’t tests be run within a newly-built Docker container while the old one is still up and serving requests to visitors? Then, if tests pass, switch over.

It’s a non-trivial re-engineering, but the benefits are significant - a rock-solid upgrade process, and high availability.


But in this case “we” knew that a plugin was merged into core. Can’t we build some checks around cases like these? I mean, if I know that plugin XYZ was merged and I know it will break the install if the plugin is still present during upgrade, I could simple disable that plugin. No?

(Mittineague) #8

IMHO, the best way to do upgrades / modifications is to not do so cold on a live site. There should be a development / staging “clone” of the site where it can be tested before making the changes to the live site.

True, some bugs may escape notice, but protracted down time resulting from a major fail can be avoided.

(cpradio) #9

You are missing the point. If I ran tests in the old location, they would still pass. If I ran same tests in the newly merged location, they too will pass. The issue wasn’t the code of the plugin, it was the implementing/installation of the plugin that failed.

Going a multiple container approach would permit not downtime and a failure to prevent taking down the site. Getting there as a default, however, is a fairly large task to take on.

(cpradio) #10

Again, not a simple problem. It was merged into core for those running on tests-passed. Those running on beta or stable do not see it. So it has to be a bit smarter than you are likely considering right this moment.

(Chris Beach) #11

Here’s the scenario:

  1. One environment - let’s call it “blue” - is up and running, serving requests to visitors. Another environment, “green,” is also up and running but not serving requests. Both are on the same discourse version.
  2. An upgrade is carried out on “green,” and then the smoke tests are run. Either the upgrade or the smoke tests fail, and the release is automatically cancelled, pending investigation. But no users are affected because they’re still using “blue.”
  3. Admin sees that “green” has failed, and diagnoses the problem.
  4. The upgrade is retried on “green.” It passes
  5. Users are switched over to “green”
  6. “blue” is upgraded.

(cpradio) #12

That is the idea behind a multiple container setup, you can technically do that right now (if you wish to go through the advanced setup and steps – but it isn’t a cake walk).

My point previously was running tests alone wouldn’t have solved this problem, you would have actually had to perform the upgrade and see that process itself failed.