UI randomly fails during a short period

Hi there !

I installed my own ‘stable’ discourse with external Postgres and Redis.
Just to precise about architecture : in Azure, 1 LoadBalancer, 1 VM hosting Discourse container with NFS share for backups and pictures, 1 Postgres, 1 Redis.

I customed it with own logo, plugin discourse-calendar and discourse-news (and other things too but irrelevant here).

Randomly, for a period like 30 min, some UI fails :

  • Main logo reverts to default one
  • Favicon reverts to default one
  • Page “upcoming-events” generated by discourse-calendar disappears (no link and 404 response when going to by url)
  • Custom logo given to discourse-news (with an url) disappears

Then it comes back.

I have nothing in logs about that.
My browser console shows nothing.
One thing i can tell is that during this period, i can see an augmentation of Redis cache misses.

Does anybody can help me to troubleshoot that ? I even do not know where i can find relevant log…

I’m afraid this is too far from the standard install for me to know anything about. :slight_smile:

Have you managed to find what you needed?

2 Likes

No i did not.
And my problem occasionally continues to appear :disappointed_relieved:
I do not know where to search for a clue…

The easiest thing would be to switch to a stabs/standard install. It would be cheaper too. I can’t imagine what it could be.

1 Like

@pfaffman i did use the standard installation from my point of view.
Except that i used provided feature to use external db and redis.
But i use the app.yaml and docker build and run described in Standard install.

I did that to be able to provide high-availability and different scaling strategies : with a full standalone deployment, you can only scale vertically (scale up your node) and not highly-available.

I see. That does sound like it should work. My best guess is that you’re scaling down to zero virtual machines and what you see is the cached site in your browser. Or some other way the kids balancer isn’t connecting to the host. Or discourse isn’t getting the real ip and it’s rate limiting (but usually you would see an error).

But your high availability features are providing low availability. Unless you’re going from having tens of users most of the time to thousands some others (as for a sports site) then scaling is likely to cause more problems than it solves.

So the first thing I’d do is get rid of the load balancer and see if that fixes it. Then decide what to do from there. If it happens once a month it won’t be easy to diagnose.