written on Monday, September 24, 2012
Out of curiosity I taught the Fireteam presence server websockets as a protocol in addition to the proprietary protocol it speaks out of the box. I don't really have a usecase for websocket per-se with the server, but I thought it might make it more useful for customers that want to use the event based parts of the API with HTML5 games. This post focuses on implementing websockets on the server, not so much about how you would use them from the client and basically collects all thing I wish I would have known before.
So let's start with that part first. What are websockets and why would you use websockets? Basically only if you want to have a bidirectional communication between an actual web browser and some server. Websocket is not necessarily a good protocol if neither of the endpoints is an actual browser. Websockets suffer a lot under the restrictions and implementation details that were forced upon the protocol to make it work with existing HTTP infrastructure.
Websockets in the current iteration as specified by RFC 6455 do a bunch of things differently to what a raw TCP connections does. The name websocket gives the impression that it's a traditional socket. In practice it combines the parts of UDP and TCP: it's message based like UDP, but it's reliable like TCP. So assuming you know what TCP is, here is what websocket adds on top of that:
Websockets make you sad. There, I said it. What started out as a really small simple thing ended up as an abomination of (what feels like) needles complexity. Now the complexity comes for a reason. The protocol went through many iterations and basically had to be changed multiple times because of unforeseen security problems that came up with misbehaving proxies. The protocol I created for the internal communication of our server is upgrading from HTTP just like websockets do, but without the “secure” parts. And here is why it does not matter:
Everybody knows about HTTP proxies. We have proxies that do load balancing on the application side, we have proxies that do SSL offloading, we have proxies for all kinds of things. Unfortunately outside our internal infrastructure everyone of us also has to deal with HTTP proxies in corporate networks, and worse, on mobile network connections. The amount of shitty HTTP middleware installed all around the world is just staggering. And this pretty much has shown me that the only way you can do anything useful these days is by putting TLS on top of everything and just force people to stop with their shenanigans. On O2's mobile networks you cannot use websockets unless they are encrypted. You cannot get websocket connections through Amazon's ELB (Their load HTTP/TCP balancer). Heck, you can't even get PATCH as an HTTP method through the ELB.
Everything that Fireteam will be doing will most likely always be behind an encrypted connection. It guarantees me that nothing can do funny things with the data I'm sending around. And as a positive side effect I don't have to mask my communication like websocket does because I know the software stack on my side until it hits the internet. The communication gets encrypted and I know nobody is going to change my data on the way to the client.
In fact, I would also recommend to always use websockets through TLS. Even if you don't care about the security side of things you will still benefit from the fact that your websocket connections succeed in many more cases. Not encrypting your connection is definitely something you will regret sooner or later.
Alright. After this fairly long disclaimer you're still there, which probably means you still want to do websockets. Fair enough. Now let's start with the basics, the handshake. This is where everything starts. It upgrades your connection from HTTP to something else. For the internal protocol we recommend to customers we upgrade our HTTP connection basically to a raw TCP connection. Websockets are not an upgrade to TCP, it's an upgrade to a message based communication.
To begin: why would you upgrade from HTTP instead of directly starting with TCP as a protocol? The reasons for why Fireteam's protocol starts with HTTP not all that different from why websockets upgrade from HTTP.
Websockets upgrade from HTTP because it was believed that people would develop servers that serve both websocket connections as well as HTTP ones. I don't believe for a second that this will be what people do at the end of the day however. A server that handles stateless requests to answer with a reply has a very different behavior than a server that keeps TCP connections open. However the advantage is that websockets use the same ports as HTTP and HTTPS do and that is a huge win. It's a win because these are privileged ports (< 1024) and they are traditionally handled differently than non privileged ports. For instance on a linux system only root can open such ports. Even more important: ELB only lets you open a handful of these privileged ports (25, 80 and 443 to be exact). Since ELB also does socket level load balancing you can still do websockets on Amazon, just not through their HTTP local balancer.
We're handling our persistent presence protocol very differently than our HTTP webservice but we still benefit from the HTTP upgrade in some edge cases. That's mainly where we have to tunnel our communication through a HTTP library because arbitrary socket connections are not possible for security or scalability reasons. If you have ever used Google Appengine or early Windows Phone you will have noticed that HTTP connections are possible where regular socket connections are not.
You will also see that many corporate networks only allow certain ports outgoing. The fact that websockets use the same port as HTTP/HTTPS make this much more interesting. If anyone has ever used the flash socket policy system will know that pain. Currently it's entirely impossible to use flash sockets behind Amazon's ELB because the Flash VM will attempt to connect to port 843 to get authorization information. That's a port you can't open on the ELB. So the idea of starting with HTTP is pretty solid.
HTTP always supported upgrades, but unfortunately many proxies seem to have ignored that part of the specification. The main reason for that probably is that until websockets came around nobody was actually using the Upgrade flag. There was an SSL upgrade RFC that used the same mechanism but I don't think anyone is using that.
Alright. So what does the handshake look like? The upgrade is initiated by the client, not by the server. The way Fireteam upgrades the connection is by following the old SSL RFC and looks like this:
OPTIONS / HTTP/1.1 Host: example.com Upgrade: firepresence X-Auth-Token: auth-info-here
The server then replies by upgrading:
HTTP/1.1 101 Switching Protocols Upgrade: firepresence/1.0 Connection: Upgrade
If the upgrade header was missing, the server instead answers with 426 Upgrade Required:
HTTP/1.1 426 Upgrade Required
What's interesting about this is that the upgrade require status code is defined, but it does not show up in the HTTP/1.1 RFC. Instead if does come from that SSL RFC.
Websockets upgrade very similar, but they are using 400 Bad Request to signal a missing upgrade. They also transmit a special key with the upgrade request which the server has to process and send back. This is done so that a websocket connection cannot be established with an endpoint that is not aware of websockets. Here is what the handshake looks like for the client:
GET / HTTP/1.1 Host: example.com Upgrade: websocket Connection: Upgrade Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ== Origin: http://example.com
The websocket key here are random bytes. The server takes these bytes and appends the special string 258EAFA5-E914-47DA-95CA-C5AB0DC85B11 to it, then creates the SHA1 hash from it and base64 encodes the result (the bytes, not the hexadecimal representation). The magic string looks like a UUID and also is one, but that's completely irrelevant because the exact string needs to be used. A lowercase representation or braces around the string would obviously fail. That value is then put into the Sec-WebSocket-Accept header. When the server has computed the value it can send an upgrade response back:
HTTP/1.1 101 Switching Protocols Upgrade: websocket Connection: Upgrade Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
The handshake can also include a protocol request and the websocket version information but you can't include arbitrary other headers. If you compare the websocket upgrade with our own upgrade you will notice that we can't transmit the authorization information. There are two ways around that. You can either transmit the authorization information as the first request or put it into the URL as query parameter.
Also notice that the Sec-WebSocket-Accept header brings it's own grammar for the value. Normally you would expect you can quote the value but the specification specifically requires a base64 value there:
Sec-WebSocket-Accept = base64-value-non-empty base64-value-non-empty = (1*base64-data [ base64-padding ]) | base64-padding base64-data = 4base64-character base64-padding = (2base64-character "==") | (3base64-character "=") base64-character = ALPHA | DIGIT | "+" | "/"
Alright. As if websockets were not painful enough as they are, someone had the amazing idea to also introduce a new URL scheme. Two in fact. ws:// and wss://. Sounds like a tiny change from http to https but unfortunately that's not the case. URLs have scheme specific grammar. For instance FTP URLs can have authorization information in the netloc part (ftp://username@server/) whereas HTTP can't. mailto URLs don't have the leading slashes etc. Websocket URLs are special in that they do not support anchors (#foo). Now why would that matter? It matters because whoever created the URL parsing module in Python also decided that they should stick as closely as possible to the strict standard that you cannot throw arbitrary URLs at the module. For instance if you would try to parse websocket URLs you quickly realize that the results are just wrong:
>>> import urlparse >>> urlparse.urlparse('wss://foo/?bar=baz') ParseResult(scheme='wss', netloc='foo', path='/?bar=baz', params='', query='', fragment='')
The reason why websockets have a separate URL is beyond me. I suspect it stems from the fact that the RFC hints towards eventually dropping the upgrade from HTTP so the HTTP URL would not make much sense. In any case it's just a very annoying example of where we now have to things now that were previously unnecessary.
Also since it's a different protocol, protocol relative links will obviously not work. You will have to switch between wss and ws by hand.
Otherwise the same rules as for HTTP style URLs apply. Namely that ws is unencrypted and has port 80 as default port and wss requires TLS encryption and port 443 as default.
Now that we know how we can connect to the websocket server, how to upgrade to the websocket protocol and how authorization can be handled without losing the IP address information even if we do TCP level load balancing. The next thing you have to know is how websocket transfer works. As mentioned earlier websocket is not a stream based protocol like TCP, it's message based. What's the difference? With TCP you send bytes around and have to make sure (for the most part) that you can figure out the end of a message. Our own protocol makes this very easy because we send full JSON objects around which are self terminating. For naive JSON parsers (like the one in the Python standard library) that cannot parse of a stream we also add a newline at the end and ensure that all newlines in JSON strings are escaped. So you can just read to the newline and then hand that line to the JSON parser.
So let's have a look first how the frames are defined. This is what the RFC provides us with:
+-+-+-+-+-------+-+-------------+-------------------------------+ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-------+-+-------------+-------------------------------+ |F|R|R|R| opcode|M| Payload len | Extended payload length | |I|S|S|S| (4) |A| (7) | (16/64) | |N|V|V|V| |S| | (if payload len==126/127) | | |1|2|3| |K| | | +-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - + | Extended payload length continued, if payload len == 127 | + - - - - - - - - - - - - - - - +-------------------------------+ | | Masking-key, if MASK set to 1 | +-------------------------------+-------------------------------+ | Masking-key (continued) | Payload Data | +-------------------------------- - - - - - - - - - - - - - - - + : Payload Data continued ... : + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + | Payload Data continued ... | +---------------------------------------------------------------+
Good news first: as of the websocket version specified by the RFC it's only a header in front of each packet. The bad news is that it's a rather complex header and it has the frighting word “mask” in it. Here are the individual parts explained:
fin (1 bit): indicates if this frame is the final frame that makes up the message. Most of the time the message fits into a single frame and this bit will always be set. Experiments show that Firefox makes a second frame after 32K however.
rsv1, rsv2, rsv3 (1 bit each): it wouldn't be a proper protocol if it did not include reserved bits. As of right now, they are unused.
opcode (4 bits): the opcode. Mainly says what the frame represents. The following values are currently in use:
(As you can see, there are enough values unused, they are reserved for future usage).
mask (1 bit): indicates if the connection is masked. As it stands right now, every message from client to server must be masked and the spec wants to to terminate the connection if it's unmasked.
payload_len (7 bits): the length of the payload. 7 bits is not enough? Of course not. Websocket frames come in the following length brackets:
0-125 mean the payload is that long. 126 means that the following two bytes indicate the length, 127 means the next 8 bytes indicate the length. So it comes in ~7bit, 16bit and 64bit. I don't even have words for this. My browser fragments off after 32K of payload anyways, when would I ever send a package of 64bit size (oh well, the most significant bit must be null at least)? 32bit would have been plenty but oh well. This also means there is more than one way to represent the length but the spec is very clear about only using the shortest possible way to define the length of a frame.
masking-key (32 bits): if the mask bit is set (and trust me, it is if you write for the server side) you can read for unsigned bytes here which are used to xor the payload with. It's used to ensure that shitty proxies cannot be abused by attackers from the client side.
payload: the actual data and most likely masked. The length of this is the length of the payload_len.
Ah, the best part. Payload data can be split up into multiple individual frames. The receiving end is supposed to buffer them up until the fin but is set. So you can transmit the string Hello World in 11 packages of 6 (header length) + 1 byte each if that's what floats your boat. However fragmentation is not allowed for control packages. However the specification wants you to be able to handle interleaved control frames. You know in case TCP packages arrive in arbitrary order :-/.
The logic for joining frames is roughly this: receive first frame, remember opcode, concatenate frame payload together until the fin bit is set. Assert that the opcode for each package is zero.
But when dealing with the payload we not only have to concatenate frames together, we also have to unmask them. The unmasking is pretty simple once you have the mask key:
uint8_t payload[payload_len]; read_bytes(payload, payload_len); for (i = 0; i < payload_len; i++) payload[i] ^= mask[i % 4];
Masking is the best part because it makes debugging so incredible fun.
Why is there masking at all? Because apparently there is enough broken infrastructure out there that lets the upgrade header go through and then handles the rest of the connection as a second HTTP request which it then stuffs into the cache. I have no words for this. In any case, the defense against that is basically a strong 32bit random number as masking key. Or you know… use TLS and don't use shitty proxies.
In the case of our proprietary protocol that's not even a problem because we only allow JSON requests. As such if you would attempt to attempt to submit a HTTP request in place of a JSON payload the server would respond with a generic error message which is not very useful attacking purposes. But really… use TLS and don't use shitty proxies.
Heartbeating is useful, I can agree with that. First of all certain things (like ELB \o/) will terminate idle connections, secondly is it not possible for the receiving side to see if the remote side terminated. Only at the next send would you realize that something went wrong. With websockets you can send the ping opcode at any time to ask the other side to pong. Pings can be sent whenever an endpoint thinks it should and a pong is sent “as soon as is practical”. Someone also decided that something like ping and pong are too simple so they were “improved” so that they can carry application data and if you pong you have to send the payload of the ping back. Sounds easy enough to implement but this actually can make it fairly annoying to deal with because now there is something you have to remember from a ping to a pong (and that application data an be up to 125 bytes).
Lastly: closing connections. Now to go with the rest of the pattern websocket rolls its own thing here. In theory a TCP disconnect should work as well but it looks like that at least Firefox just reconnects on connection drop. Instead a connection is terminated by sending the close opcode (0x08). There the pattern seems to be to exchange close opcodes first and then let the server shut down. The client is supposed to give the server some time to close the connection before attempting to do that on its own. The close can also signal why it terminated the connection. The lazy person I am I just did not care and just close. Important however is that you do send the opcode around, otherwise at least Firefox will not really believe that you closed the connection.
It should be said that the specification does not introduce a close opcode just because to make the protocol more complex. It does have actual use in that it makes the disconnect more reliable. Anyone that ever had to deal with TCP disconnects will know that this can be a somewhat tricky thing to do and behaves differently on different environments. That being said, I don't believe that websocket implementations will get disconnects right either which now leaves developers on both sides hope that the implementation is correct. It's too far from TCP that you could fix the problem yourself if you are programmer that writes the client-side implementation.
There are also differences between how browsers respond to websockets. For instance if you do not provide an application level protocol with Chrome but the server emits the Sec-WebSocket-Protocol header Chrome will loudly complain about a protocol mismatch whereas Firefox does not care a single bit. Safari 5.1 (the one I have installed) does not speak the current protocol of websockets and sends different headers altogether.
What's also interesting is that Firefox treats the upgrade as an actual HTTP request and sends the regular headers (User agent, DNT flag, accept headers, cache control, etc.). On the other hand Chrome will just submit the bare minimum and special cases the cookie header. As such you won't be able to do browser based detection just from the handshake.
Websockets are complex, way more complex than I anticipated. I can understand that they work that way but I definitely don't see a value in using websockets instead of regular TCP connections if all you want is to exchange data between different endpoints and neither is a browser. If you do want to make them work I recommend the following things:
Looking at the RFC it's pretty clear that websockets won't be getting any simpler any time soon. It hints towards adding multiplexing support and dropping the remaining HTTP upgrade parts. Also there is a builtin extension system that will soon be used to negotiate per-frame compression and that's definitely not the end of it. I also did not mention that you can negotiate version numbers and application level protocols through the websocket handshake or that it defines termination codes when the connection closes. There is an awful lot of stuff in the specification and even more to come which gives me the impression that we will see some broken implementations of websockets in the future.
Maybe a simple policy file like Unity3D and Flash are using would have been a better idea and just let people speak TCP themselves. At least it leaves you as application developer with more options to fix problems yourself instead of hoping the websocket implementation is 100% correct. But well, that's what we're stuck with now and since it already took three or four major revisions of the specification and god knows how many browser updates it's probably not a very wise decision to revisit the protocol now. I do however believe that when browsers finally get CORS running for SSE this might be a better solution for many use-cases where people might want to use websockets. And that is definitely easier to implement.