written on Tuesday, May 25, 2010
Yesterday after my talk about WSGI on Python 3 I announced an OpenSpace about WSGI. However only two people showed up there which was quite disappointing. On the bright side however: it was in parallel to some interesting lighting talks and I did not explain to well what the purpose of this OpenSpace was.
In order to do better this time around, I want to summarize the current situation of WSGI on Python 3, what the options are and why I'm at the moment thinking of going back to an earlier proposal that was dismissed already.
So here we go again:
There are a couple of changes in the Python language that are relevant to WSGI because they make certain things harder to implement and others easier. In Python 2.x bytestrings and unicode strings shared many methods and Python would do a lot to make it easy for you to implicitly switch between the two types. The root cause of the unicode decode and unicode encode errors everybody knows in Python are often caused by the implicit conversion going on.
Now in Python 3 the whole thing looks a lot different. There are only unicode strings now and the bytestrings got replaced by things that are more like arrays than strings. Take this Python 2 example:
>>> 'foo' + u'bar'
u'foobar'
>>> 'foo %s' % 42
'foo 42'
>>> print 'foo'
foo
>>> list('foo')
['f', 'o', 'o']
>>> b'foo' + 'bar'
Traceback (most recent call last):
File "", line 1, in
TypeError: can't concat bytes to str
>>> b'foo %s' % 42
Traceback (most recent call last):
File "", line 1, in
TypeError: unsupported operand type(s) for %: 'bytes' and 'int'
>>> print(b'foo')
b'foo'
>>> list(b'foo')
[102, 111, 111]
There are ways to convert these bytes to unicode strings and the other way round, there are also string methods like title() and upper() and everything you know from a string, but it still does not behave like a string. Keep this in mind when reading the rest of this article, because that explains why the straightforward approach does not work out too well at the moment.
WSGI like HTTP or URIs are all based on ASCII or an encoding like latin1 or even different encodings. But all those are not based on a single encoding that represents unicode. In Python 2 the unicode situation for web applications was fixed pretty quickly by all frameworks in the same way: you as the framework/application know the encoding, so decode incoming request data from the given charset and operate on unicode internally. If you go to the database, back to HTTP or something else that does not operate on unicode, encode to the target encoding which you know.
This is painless some libraries like Django make it even less painful by having special helpers that can convert between utf-8 encoded strings and actual unicode objects at any point. Here a list of web related libraries operating on unicode (just a small pick): Django, Pylons, TurboGears 2, WebOb, Werkzeug, Jinja, SQLAlchemy, Genshi, simplejson, feedparser and the list goes on.
What these libraries can have, what a protocol like WSGI does not, is having the knowledge of the encoding used. Why? Because in practice (not on the paper) encodings on the web are very simple and driven by the application: the encoding the application sends out is the encoding that comes back. It's as simple as that. However WSGI does not have that knowledge because how would you tell WSGI what encoding to assume? There is no configuration for WSGI so the only thing we could do is forcing a specific charset for WSGI applications on Python 3 if we want to get unicode onto that layer. Like utf-8 for everything except headers which should be latin1 for RFC compliance.
On Python 2 WSGI is based on bytes. If we would go with bytes on Python 3 as well, the specification for Python 3 would look like this:
If we ignore everything else that makes this approach hard on Python 3 and only look at the bytes object which just does not behave like a standard string any more, a WSGI library based on the standard libraries functions and the bytes type is quite complex compared to the Python 2 counterpart. Take the very simple code commonly used to reproduce a URL from the WSGI environment on Python 2:
def get_host(environ):
if 'HTTP_HOST' in environ:
return environ['HTTP_HOST']
result = environ['SERVER_NAME']
if (environ['wsgi.url_scheme'], environ['SERVER_PORT']) not \
in (('https', '443'), ('http', '80')):
result += ':' + environ['SERVER_PORT']
return result
def get_current_url(environ):
rv = '%s://%s/%s%s' % (
environ['wsgi.url_scheme'],
get_host(environ),
urllib.quote(environ.get('SCRIPT_NAME', '').strip('/')),
urllib.quote('/' + environ.get('PATH_INFO', '').lstrip('/'))
)
qs = environ.get('QUERY_STRING')
if qs:
rv += '?' + qs
return rv
This depends on many string operations and is entirely based on bytes (like URLs are). So what has to be changed to make this code work on Python 3? Here an untested version of the same code adapted to theoretically run on a byte based WSGI implementation for Python 3.
The get_host() function is easy to port because it only concatenates bytes. This works exactly the same on Python 3, but we could even improve that theoretically by switching to bytearrays which are mutable bytes objects which in theory give us better memory management. But here the straightforward port:
def get_host(environ):
if 'HTTP_HOST' in environ:
return environ['HTTP_HOST']
result = environ['SERVER_NAME']
if (environ['wsgi.url_scheme'], environ['SERVER_PORT']) not \
in ((b'https', b'443'), (b'http', b'80')):
result += b':' + environ['SERVER_PORT']
return result
The port of the actual get_current_url() function is a little different because the string formatting feature used for the Python 2 implementation are no longer available:
def get_current_url(environ):
rv = (
environ['wsgi.url_scheme'] + b'://'
get_host(environ) + b'/'
urllib.quote(environ.get('SCRIPT_NAME', b'').strip(b'/')) +
urllib.quote(b'/' + environ.get('PATH_INFO', b'').lstrip(b'/'))
)
qs = environ.get('QUERY_STRING')
if qs:
rv += b'?' + qs
return rv
The example did not become necessarily harder, but it became a little bit more low level. When the developers of the standard library ported over some of the functions and classes related to web development they decided to introduce unicode in places where it's does not really belong. It's an understandable decision based on how byte strings work on Python 3, but it does cause some problems. Here a list of places where we have unicode, where we previously did not have it. Not judging here on if the decision was right or wrong to introduce unicode there, just that it happened:
There are some other places as well where unicode appeared, but these are the ones causing the most troubles besides the bytes not being a string thing. Now what later most of WEB-SIG agreed with and what Graham implemented for mod_wsgi ultimately is a fake unicode approach. What does this mean? Make sure that all the information is stored as unicode but not with the proper encoding (which WSGI would not know) but just assume latin1. If latin1 is not what the application expected, the application can encode back to latin1 and decode from utf-8. (As far as I know, this is loss-less).
Here what the current specification looks like that is about to be crafted into a PEP:
I did some tests lately with toying around and starting to work on a port of Werkzeug but the more I worked with it, the more I disliked it. WSGI in Python 2 was already a protocol that was far more complex than it should have been and some parts of it just don't make any sense (like the input stream having readline without size) but it was something you could get started quickly and the basics were simple. Middlewares, the area where WSGI was already a far too complex now just become more complex because they have to encode unicode strings before they can operate on them, even if it's just comparing.
It just feels like the more I play with it, the more unhappy I become with how the bytes object works and how the standard library behaves. And I doubt I will be the only one here. It's just that playing with the actual code shows problems you wouldn't spot on the paper so I would love to see a wider crowd of people toying with both the language and specification to make sure WSGI stays a specification everybody is happy with.
Right now I'm a little bit afraid we end up with a specification that requires use to do the encode/decode/encode/decode dance just because the standard library and a limitation on the bytes object makes us do. Because one thing is for certain: ASCII and bytes are here to stay. Nobody can change the protocols that are in use, and even those would on the very bottom have to be based on bytes. And if the tools to work with them are not good enough in Python 3 we will see the problems with that on multiple levels, not just WSGI (Databases, email, and more).
What I currently have in mind is a bit more than what was ever on discussion for WSGI which is why I don't expect anything like that to be implemented, but it can't harm sharing:
I am happy to accept a quasi-unicode support as well and will port Werkzeug over to it. But it's probably still the time to improve the specification and language that everybody is happy. Right now it looks like not a lot of people are playing with the specification, the language and the implications of all that. The reason why Python 3 is not as good as it could be, is that far too few people look at it. It is clear that the future of Python will be Python 3 and that there are no intentions of make other releases than Python 2.7, so to make the process less painful it's necessary to start playing with it now.
So I encourage everyone to play with Python 3, the spec, the standard library so that there is more input. Maybe the bytes issue does look like I think it is, maybe it's not. But if only a four people are discussing the issue, there is too few input to make rational decisions.