I have been having some trouble since last night with my Discourse setup. I would appreciate if someone could help me troubleshoot.
Timeline:
Users report can’t access the site. I can reproduce 100%. When logging into my DO droplet (ubuntu lts 16.04 x64) I can see the OS is asking for a reboot (never happened before). Rebooted and back (in my PC) to regular service
As I was already offline I took the opportunity to upgrade Discourse. I rebuilt to latest (2.3.0 beta2) and everything seemed to get back to work (Safari on Mac)
I noticed that docker-engine was deprecated so I uninstalled and installed docker-ce. Everything working fine.
Hours later users report issues which I can’t reproduce, until I start using some combinations:
Works on FF + Win
Does not work on Chrome + Win
Works on ios + Safari + wifi
Does not work on ios + Safari + 4g
All very weird as you can see
I see that all the logos are gone, which I notice after seeing errors in logfile:
I can see that this is a known issue so I proceed to reupload logos and everything seems to go back to normal, nearly.
Now Chrome + Win works but not any of the others. IE returns a 504 which some users can see as well. In the combinations which work the site loads quickly as ever.
Some other weird problems I’ve noticed is Firefox complaining about the certificate (Let’s Encrypt) but Chrome being fine.
EDIT: The certificate seems to be fine, for some reason I noticed FF was reporting I had added an exception, which I have no memory of having done. Once removed green padlock again…
I know this is loose an open-ended, but where would you advise that I start? I would say that the 504 problem is the most concerning of all as I suspect that one explains the non access problems.
I don’t have access to the console to check disk space and ram now, unfortunately, but I don’t think that is the problem as it presents consistently. When the site works it does work all the time (combination of user and environment) when it doesn’t, it doesn’t at all. For some reason it seems to be related to the connection (once I reloaded the logos, that is)
Will do that. Would you reckon any chance of problem with nginx? How should I go about that? Might it be caching like forever and hence users see site persistently down consistently if they tried to access at a certain time and it was down?
OK, thanks, I have filed an issue with DigitalOcean as well, just in case there is some relationship. As a matter of fact they did some network changes yesterday. I would be surprised they broke something and went unnoticed for this long but who knows. As I’m saying the bit making me so suspicious is that the problem seems to be user connection related.
So as it turns out, our domain registration expired. Embarrassing I know. Fortunately we have been able to rescue the situation. Apologies for the time waste and thank you for your help.