Why does Discourse not use Web Sockets


(Sam Saffron) #1

(This is an old topic I wrote on Dev that belongs here)

I wrote this a year and half ago, keep that in mind when reading.

Background

The “message bus” is component that allows us to easily publish information to our clients and between the rails processes in the farm. At its core it provides a few very simple APIs.

# publish a message to all subscribers on the particular channel
MessageBus.publish '/channel_name', data

# then on the server side you can
MessageBus.subscribe '/channel_name' do |msg|
  do_something msg.data
end

Or on the js client we can:

Discourse.MessageBus.subscribe('/channel_name', function(data){ /* do stuff with data*/ });

This simple apis hide a bunch of intricacies, we can publish messages to a subset of users:

MessageBus.publish '/private_channel', 'secret', user_ids: {1,2,3} 

The message bus “understands” what site message belong to, transparently. Our rails apps have the ability to serve multiple web sites (eg: ember and dev are served in the same process - different dbs).

What changed?

I always really liked the API, its simple enough, however the bus itself was inherently unreliable. When the server sent messages to the client there was no Acking mechanism. If a server restarted there was no way for it to “catch up”.

To resolve this I create an abstraction I call ReliableMessageBus, at its core it allows you to catch up on any messages in a channel. This involves some fairly tricky redis code, it means that when stuff is published on a redis channel it is also stored in a list:

def publish(channel, data)
  redis = pub_redis 
  offset_key = offset_key(channel)
  backlog_key = backlog_key(channel)

  redis.watch(offset_key, backlog_key, global_id_key, global_backlog_key, global_offset_key) do
    offset = redis.get(offset_key).to_i
    backlog = redis.llen(backlog_key).to_i

    global_offset = redis.get(global_offset_key).to_i
    global_backlog = redis.llen(global_backlog_key).to_i

    global_id = redis.get(global_id_key).to_i
    global_id += 1

    too_big = backlog + 1 > @max_backlog_size
    global_too_big = global_backlog + 1 > @max_global_backlog_size

    message_id = backlog + offset + 1 
    redis.multi do 
      if too_big
        redis.ltrim backlog_key, (backlog+1) - @max_backlog_size, -1
        offset += (backlog+1) - @max_backlog_size
        redis.set(offset_key, offset)
      end

      if global_too_big
        redis.ltrim global_backlog_key, (global_backlog+1) - @max_global_backlog_size, -1
        global_offset += (global_backlog+1) - @max_global_backlog_size
        redis.set(global_offset_key, global_offset)
      end

      msg = MessageBus::Message.new global_id, message_id, channel, data
      payload = msg.encode

      redis.set global_id_key, global_id
      redis.rpush backlog_key, payload
      redis.rpush global_backlog_key, message_id.to_s << "|" << channel
      redis.publish redis_channel_name, payload
    end

    return message_id
  end
end

The reliable message bus allows anyone to catch up on missed messages (it also caps the size of the backlog for sanity)

With these bits in place it was fairly straight forward to implement both polling and long-polling, two things I had not implemented in the past. The key was that I had a clean way of catching up.

Why I hate web sockets and disabled them?

My initial implementation was unreliable, but web sockets made it mostly sort of work. With web sockets you have this false sense that its just simple enough just to hookup a few callbacks on your socket and all is good. You don’t worry about backlogs, the socket is always up and everything else is an edge case.

However, web sockets are just jam packed full of edge cases:

  • There are a ton web socket implementations that you need to worry about: multiple framing protocols, multiple handshake protocols, and tons of weird bugs like needing to hack stuff so haproxy forgives some insane flavors of the protocol
  • Some networks (and mobile networks) decide to disable web sockets altogether, like Telstra in Australia
  • Proxies disable them
  • Getting SSL web sockets to work is a big pain, but if you want web sockets to work you must be on SSL.
  • Web sockets don’t magically catch up when you open your laptop lid after being closed for 1 hour.

So, if you decide to support web sockets you carry a big amount of code and configuration around, you also are forced to implement polling anyway cause you can not guarantee clients support it, cause networks may and do disable them.

My call is that all this hoop jumping and complex class of bugs that would follow it is just not worth it. Given that nginx can support 500k reqs a second our bottleneck is not the network. Our bottleneck for the message bus is Ruby and Redis, we just need to make sure those bits are super fast.

I really hate all the buzz around web sockets, so much of it seems to be about how cool web sockets are, a lot less is based on evidence that sockets actually improve performance or even network performance in a way that significantly matters. gmail is doing just fine without web sockets.

This makes it easier to deploy Discourse

Now that I rid us of the strong web socket dependency and made “long polling” optional (in site settings) many can deploy discourse on app servers like passenger, if they wish. It will not perform as well, updates will not be as instant, but it will work.


I wrote that about a year ago, but it is still pretty much true today.


JS-less user interface
Why Ruby and not PHP
(Salman, Freelance Developer) #2

Hi Sam,

Interesting take on this issue, thanks for sharing. Did much of your opinions come from using websocks at SO?

Seeing as Discourse is not using websockets, is it simply polling ever x seconds?

Has there been any benchmarks on what a DO 2GB server can handle in terms of currently logged in users and polling every x seconds?


(Sam Saffron) #3

I was not involved with the implementation directly at SO, Marc wrote a lot of it, but getting the framing protocols all right there was a nightmare, also from day 1 SO had a fallback option.

No, that is not the case, we use long polling, which means a connection is held open for 29 seconds waiting for data, it is notified right away when data arrives. There are extra safeguards like moving to slow polling when tab is out of focus.

We are yet to see any implementations hit bottlenecks here. (raw connection ones, fallbacks are super robust)