Discourse crashed after backup - http 500

(Jay Pfaffman) #21

You can do a

launcher cleanup

To reclaim the docker space,but I don’t think that’ll solve your immediate problem.

I don’t know what your problem could be.

(Rich) #22

Ok, thanks @pfaffman, I’ll perhaps start a new thread with more details :+1:

(Matt Palmer) #23

I have no idea. It appeared to be accidentally fixed in 17.12, but since it was only an accidental fix, it could very well be back in 18.01. I suggest you try the repro steps I put in the bug report to verify for yourself.

One thing to note is that the problem is actually in the “shim” process between dockerd and the container; depending on your system setup and the upgrade approach taken, the old version of the shim may still be running. It’s best to completely manually remove the running container (docker rm -f app), stop the docker service (service docker stop), check that everything is truly dead (ps ax |grep docker), and then start docker (service docker start) and the app container (./launcher start app) again. That’ll make sure you’re definitely running the latest version of everything.

(Jeff Atwood) #24

Note that we changed the backup process on our end not to spew out ginormous backup log traces, so as long as you are on Discourse latest you might not run into it regardless of whether it was fixed on Docker’s end or not.

(Rich) #25

Hi everyone, just to update and confirm.

As mentioned, I apt-get upgraded Docker to v18.01 and sure enough, from that moment on, I was having the exact same issues with my backups (even though I’m running the very latest Discourse version (v2.0.0.beta1 +123)).

So as planned, I rolled my Docker back from 18.01 down to 17.10 and sure enough, instant cure.

Also, I noticed the issue @mpalmer placed on github is still open so while they may have fixed it in a +17 release, it appears to still be very present in 18.01

Thanks for your help everyone, my Discourse is now running like a dream again :heart_eyes:

Is it safe to roll Docker from 18.01 back to 17.10?
(Matt Palmer) #26

Whatever Docker bug has hit you, I don’t think it’s the one I previously reported, for two reasons:

  1. I can’t repro my bug on Docker 18.01 using the recipe in the bug report, whereas it was 100% reliable repro on 17.11. EDIT: Turns out I can repro, I just need to have a lot more patience (takes several minutes to assplode now, rather than a few seconds)
  2. We mitigated the proximate trigger of the bug within Discourse itself before the 1.9 release, by sending postgres logs to files rather than Docker.

So whilst it’s good to know that 18.01 appears to have problems, they are almost certainly different problems, which will need a separate round of diagnosis and replication before they can be reported to Docker to be fixed.

EDIT: I haven’t yet been able to reproduce a Discourse hang due to backups on Docker 18.01, so if anyone’s got a hanging system they’re happy to leave “dead” for a period of hours whilst I dig into what’s going wrong, please PM me. Further posts without significant technical details (like strace output and Go stacktraces, or a reliable reproduction recipe on a bare system) are unlikely to be helpful, so feel free to leave the “Reply” button unmolested if you just want to say “this is happening to me too!”.

(Sam Saffron) #27

Note before handing Matt a system please do a ./launcher rebuild app so you pick up all the fixes we applied. We are looking for a “crashed” system that is running both latest code and latest container / base image / bootstrap