written on January 22, 2011
For web applications the safest bet currently is to stick with Python 2.x even for new projects. For the simple reason that right now we don't have enough supporting libraries for Python 3 yet and porting some of them over is a huge step. But with all the people telling one that it's hard and painful to upgrade to Python 3, how does one make this upgrade easier?
For high level applications an upgrade is actually quite simple if it can trust the supporting libraries to have consistent behaviour after it's ported to Python 3. In fact there is no reason why an upgrade to Python 3 shouldn't be possible in a painless way. So here is a list of dos and don'ts for writing new Code.
For new projects, start with Python 2.6 or 2.7. They provide a lot of things that make an upgrade to Python 3 easier for you. If you don't have to support older versions of Python you can already use a lot of the stuff that is in Python 3 by explicitly opting them in.
You should use the following things from __future__
:
division
. I must admit that I hate the future division import in
Python 2. It constantly makes me jump to the beginning of the
file to check what division mode is active for a module when I do core
review. However because it will be the default in Python 3 (and the
only mode) you really should be using it nowadays.
absolute_import
. The most important one. No longer will from xml import bar
import a module foo.xml
from the foo
package when you
are inside it. Instead you explicitly have to do from .xml import bar
to get to what you want. Less confusion and incredible helpful.
Regarding the print-as-a-function future import, I recommend against using
it to avoid confusion. Especially because all editors are currently
highlighting it as a keyword it can become confusing quickly. Generally
if things behave differently in different files it's a good idea to avoid
these things if possible. The great aspect of the print change is that it
can be reliably converted with 2to3, so there is really no reason to use
the print_function
future import.
While it might be appealing, better do not use the unicode_literals
future import. For the very simple reason that may APIs are changing the
supported string types in different places and unicode_literals
is
counterproductive. There are of course places where this feature import
is useful, but that's more limited to lower level interfaces (libraries)
and those can't use that import anytime soon anyways because it came with
Python 2.6. To get access to the b'foo'
iteral you do not need this
specific import. That is available either way and is a great help for
2to3.
File IO changed greatly in Python 3. Thankfully if you are designing new APIs for new projects you can save yourself a lot of hassle by deciding explicitly for unicode.
If you are dealing with text data, use the codecs.open function for opening the
files. Assume utf-8 encoding unless explicitly differently defined and
operate on unicode strings only. For binary IO make sure to open the file
with 'rb'
instead of 'r'
and you are set. That was required for
proper Windows support already anyways.
If you are doing byte based data processing mark strings that are bytes
only with b'foo'
instead of 'foo'
which tells 2to3 to not convert
these string literals to unicode. Please be aware of the following
differences between Python 2.6:
>>> b'foo'
'foo'
>>> b'foo'[0]
'f'
>>> b'foo' + u'bar'
u'foobar'
>>> list(b'foo')
['f', 'o', 'o']
and Python 3 regarding byte strings:
>>> b'foo'[0]
102
>>> b'foo' + 'bar'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str
>>> list(b'foo')
[102, 111, 111]
As a replacement for the above Python 2 idioms, you can use this instead:
>>> b'foo'[0:0 + 1]
b'f'
>>> b'foo' + 'bar'.encode('latin1')
b'foobar'
>>> to_charlist = lambda x: [x[c:c + 1] for c in range(len(x))]
>>> to_charlist(b'foo')
[b'f', b'o', b'o']
These will work on both 2.6 and 3.x.
There are a couple of things where 2to3 will be pretty counterproductive. Some of these are cases where 2to3 seems to have a bug, others are the cases where it just does not know enough of your code to make proper predictions.
A lot of people are using code like this on Python 2:
class Foo(object):
def __str__(self):
return unicode(self).encode('utf-8')
def __unicode__(self):
return u'Hello World'
2to3 assumes that your API is not unicode compatible and will convert it to this:
class Foo(object):
def __str__(self):
return str(self).encode('utf-8')
def __unicode__(self):
return 'Hello World'
Now this is just wrong. First of all __unicode__
is unused in Python 3,
secondly __str__
now calls into itself and will trigger a runtime error
because of recursion when str()
is called on an instance of Foo
. This
can be solved with either a custom 2to3 fixer or a little helper class
that makes a check for Python 3:
import sys
class UnicodeMixin(object):
if sys.version_info > (3, 0):
__str__ = lambda x: x.__unicode__()
else:
__str__ = lambda x: unicode(x).encode('utf-8')
class Foo(UnicodeMixin):
def __unicode__(self):
return u'Hello World'
That way your object will still have an __unicode__
attribute on Python
3, but that will not do any harm. When you then want to drop Python 2
support you just have to go over all subclasses of UnicodeMixin
and
rename __unicode__
to __str__
and remove the helper class.
This problem is a little more tricky. In Python 2 the following is true:
>>> 'foo' == u'foo'
True
Not so in Python 3:
>>> b'foo' == 'foo'
False
What's worse here is that Python 2 does not emit a warning on comparisons (neither with or without Python-3-warnings flag) and neither will Python 3. So how can you spot these cases? I wrote a small helper module called unicode-nazi which once imported will warn automatically if you do something that is not purely a unicode or bytestring operation:
>>> import unicodenazi
>>> u'foo' == 'foo'
__main__:1: UnicodeWarning: Implicit conversion of str to unicode
True
But be aware that this module is very noisy and has a noticeable runtime overhead.
Here a table of things that are bytestrings and what they usually become in Python 3:
Type
Type in Python 3 (unicode == str)
identifiers
unicode
Docstrings
unicode
__repr__
unicode
string keys of dictionaries
unicode
WSGI environment keys
unicode
HTTP header values, WSGI environment values
unicode
, limited to ASCII in 3.1 and
limited to latin1 in 3.2
URLs
unicode
, but some APIs also accept byte
strings. Special attention: your URLs have
to be encoded in UTF-8 in order to use all
of the standard library functions.
Filenames
unicode
or bytes
. Most APIs accept both
but implicit conversions are not supported.
Binary contents
bytes
or bytearray
. Beware: the second
type is mutable, so be aware of the fact
that you can have a string-ish object that
is mutable.
Python code
unicode
. You have to decode the source
yourself when you pass it over to exec
.
In some places (WSGI for instance) there is now the notion of unicode strings that must only be a subset of latin1. That's the case because the HTTP spec is not very clear on encodings and it was decided to assume latin1 to be safe. If you control both ends of the communication (like you do with cookies) you can of course use utf-8 if you like. So how does this work if the header is limited to latin1? For Python 3 (and only for Python 3) you will need to apply a little trick:
return cookie_value.encode('utf-8').decode('latin1')
That way you just fake encoded utf-8 into a unicode string. The WSGI layer will then again encode this string as latin1 and you are transmitting wrong utf-8 as latin1 over the wire. If you do the inverse of that trick on the receiving end it will work.
That's of course ugly, but that's pretty much how utf-8 in headers already worked. And it's really just the cookie header that is affected by that, and that header was unreliable anyways.
The only other place in WSGI where this will become an issue is the
PATH_INFO
/ SCRIPT_NAME
tuple, but your framework should figure that
out for you when it's working on Python 3.