MessageBus short polling is not working


(Leo McArdle) #1

We’ve been having some upload issues on our Discourse instance recently: most of the time, if you upload an image it will never finish. Occasionally it will, but only after a lengthy wait.

After doing some investigation, it seems to be down to a problem with MessageBus. If I subscribe to the /uploads/composer channel on the server, I’ll get a message containing a working url almost as soon as I’ve clicked upload. The message will usually never get to the client.

This doesn’t just affect uploads, but seemingly all MessageBus channels. Post edits won’t appear before a refresh, and if I subscribe to a test channel I create on the client and push messages to it from the server they won’t arrive.

The one scenario where messages will arrive (more often than not) is if I have two discourse tabs open - when I switch to the second tab, the message will arrive in the first tab.

This is definitely a misconfiguration we have on discourse.mozilla.org - as uploads are working fine here - but I’m not sure what it could be. We only put static assets behind a CDN, and dynamic content is behind ELB and HAProxy - but we’ve changed none of that since we’ve started seeing issues.

Anyone got an idea what could be going on here? I’m pretty stumped… :deciduous_tree:


(Dean Taylor) #2

Noted your long polling events are completing almost immediately ~140ms

Compared with ~25s here at meta:


(Felix Freiberger) #3

What’s the value of the site setting long polling base url?
Did you configure anything special in ELB and HAProxy regarding Message Bus / long polling?


(Leo McArdle) #4

Everything under /admin/site_settings/category/developer is set to default, apart from enable long polling. That was set to false back in 2016 (well before we started seeing this issue), so as expected enabling it didn’t make any difference.

So, long polling base url is the default of /.

No, we’re using the default configuration.

Bizarre… long polling interval is set to the default of 25000.


(Dean Taylor) #5

Not that I have much experience in this specific problem, but some thoughts…

Depending on your AWS ELB setup you might be falling into the “Full site CDN acceleration” category.
This depends on settings etc at many levels before you get to Discourse, ALB, NLB or Classic Load Balancer might make a difference.

Specifically point you at rule number one:


(Sam Saffron) #6

This looks like a bug in message bus short polling to me, you are subscribing on -1 but not getting the position in the channel like you should. You really should fix long polling cause its really inefficient the way you have it going, but regardless bugs in short polling are no good so we should fix them.

Added to my list and will look at adding some tests for this next week.


(Leo McArdle) #7

I don’t think we are. @yousef assures me that every request to discourse.mozilla.org will always hit the docker container. Neither ELB nor HAProxy will ever fetch from a cache instead.

Totally up for that, but I’m not sure I even understand how long polling is broken. What would cause Discourse to use short polling over long polling, like it is on discourse.mozilla.org?


(Sam Saffron) #8

enable long polling being set to false perhaps some proxy running http 1.0 ? I really don’t know need to know the exact design you have and all the site settings.


(Dean Taylor) #9

Well, what I mean by that is there are settings you might need to configure / might not have control over.

For example there is the lovely 60 second default timeout for ELB:

http://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/how-elastic-load-balancing-works.html#http-connections

You can set an idle timeout value for both Application Load Balancers and Classic Load Balancers. The default value is 60 seconds. With an Application Load Balancer, the idle timeout value applies only to front-end connections. With a Classic Load Balancer, if a front-end connection or a back-end connection is idle for longer than the idle timeout value, the connection is torn down and the client receives an error response. A registered target can use a keep-alive timeout to keep a back-end connection open until it is ready to tear it down.

Application Load Balancers and Classic Load Balancers support pipelined HTTP on front-end connections. They do not support pipelined HTTP on back-end connections.


(Graham Perrin) #10

@LeoMcA and all: thank you.


Cross references:

https://matrix.to/#/!EIfzkjZqlOZlFzvuWj:matrix.org/$15042507233254388CnwUE:matrix.org (2017-09-01):

Yesterday and today I have sporadic trouble with Firefox drag and drop of images to Discourse. Typically apparently stuck at:

Uploading 100%

– and:


(Leo McArdle) #11

Long polling seems to have magically started working again:

…and with it uploads and edits. ¯\_(ツ)_/¯


(Sam Saffron) #12

For what it is worth I just fixed up the issues in message bus per:

Great catch … moving this to #bug and closing!


Uploading pictures stuck at 100%
(Sam Saffron) #13