We’ve been having some upload issues on our Discourse instance recently: most of the time, if you upload an image it will never finish. Occasionally it will, but only after a lengthy wait.
After doing some investigation, it seems to be down to a problem with MessageBus. If I subscribe to the /uploads/composer channel on the server, I’ll get a message containing a working url almost as soon as I’ve clicked upload. The message will usually never get to the client.
This doesn’t just affect uploads, but seemingly all MessageBus channels. Post edits won’t appear before a refresh, and if I subscribe to a test channel I create on the client and push messages to it from the server they won’t arrive.
The one scenario where messages will arrive (more often than not) is if I have two discourse tabs open - when I switch to the second tab, the message will arrive in the first tab.
This is definitely a misconfiguration we have on discourse.mozilla.org - as uploads are working fine here - but I’m not sure what it could be. We only put static assets behind a CDN, and dynamic content is behind ELB and HAProxy - but we’ve changed none of that since we’ve started seeing issues.
Anyone got an idea what could be going on here? I’m pretty stumped…
What’s the value of the site setting long polling base url?
Did you configure anything special in ELB and HAProxy regarding Message Bus / long polling?
Everything under /admin/site_settings/category/developer is set to default, apart from enable long polling. That was set to false back in 2016 (well before we started seeing this issue), so as expected enabling it didn’t make any difference.
So, long polling base url is the default of /.
No, we’re using the default configuration.
Bizarre… long polling interval is set to the default of 25000.
Not that I have much experience in this specific problem, but some thoughts…
Depending on your AWS ELB setup you might be falling into the “Full site CDN acceleration” category.
This depends on settings etc at many levels before you get to Discourse, ALB, NLB or Classic Load Balancer might make a difference.
This looks like a bug in message bus short polling to me, you are subscribing on -1 but not getting the position in the channel like you should. You really should fix long polling cause its really inefficient the way you have it going, but regardless bugs in short polling are no good so we should fix them.
Added to my list and will look at adding some tests for this next week.
I don’t think we are. @yousef assures me that every request to discourse.mozilla.org will always hit the docker container. Neither ELB nor HAProxy will ever fetch from a cache instead.
Totally up for that, but I’m not sure I even understand how long polling is broken. What would cause Discourse to use short polling over long polling, like it is on discourse.mozilla.org?
enable long polling being set to false perhaps some proxy running http 1.0 ? I really don’t know need to know the exact design you have and all the site settings.
Well, what I mean by that is there are settings you might need to configure / might not have control over.
For example there is the lovely 60 second default timeout for ELB:
You can set an idle timeout value for both Application Load Balancers and Classic Load Balancers. The default value is 60 seconds. With an Application Load Balancer, the idle timeout value applies only to front-end connections. With a Classic Load Balancer, if a front-end connection or a back-end connection is idle for longer than the idle timeout value, the connection is torn down and the client receives an error response. A registered target can use a keep-alive timeout to keep a back-end connection open until it is ready to tear it down.
Application Load Balancers and Classic Load Balancers support pipelined HTTP on front-end connections. They do not support pipelined HTTP on back-end connections.