written on February 11, 2010
The latest Jinja 2 release came with basic support for Python 3. It was surprisingly painless to port the application over but it did require a substantial amount of tweaks and code changes in order to get it running. For everyone else out there who is interested in getting started, I decided to share my experiences:
Before you start porting the library you have to decide how interfaces will behave in Python 3. The biggest issue here is obviously unicode, but there are others as well. I would say there are four kinds of libraries you might encounter regarding string behavior in Python 2: There are the libraries that only accept unicode and only output unicode, there are those that only accept byte-strings and output byte-strings but operate on textual data, there are the libraries that operate on either or and what has been fed into it, comes out of it and there are libraries that operate either on unicode or byte-strings and also accept the other type as long as it's a subset of the default encoding (ASCII).
First you have to find out what your library does, what it is supposed
to do, and how you want to deal with that in Python 3. Because
byte-strings no longer exist in Python 3 and were replaced by a bytes
object that works similar, but has an incompatible API it is very
unlikely that your code will be able to support both in the future (or
that it is something you would desire).
This is might the most tricky one if you are aiming for Python 2.5
support or lower and you are operating on bytes directly. The issue is
that the way you operate on bytes changed fundamentally from Python 2.x
to 3.x and 2to3 is not really able to pick it up. Worse, it will try
convert all your bytestring literals to unicode! The official support is
as far as I know, to explicitly prefix the byte strings in the 2.x code
with a leading b
to indicate bytes. Unfortunately that means no
support for 2.x. I am not completely sure what to do in that situation,
but at least I found a way to trick python to operate on bytes: if you
have code like this:
magic = 'M23\x01'
And you want to ensure it does not end up being a str
in 3.x, add a
dummy encode:
magic = u'M23\x01'.encode('iso-8859-1')
The only downside is that the encode happens at runtime, so it will slow down execution a bit.
The second kind of library is a library that operates on text. In 2.x
there were multiple ways to implement such libraries and it basically
came down to what data type was used internally and what was accepted
for input and output. There are the libraries that operate exclusively
either on bytestrings or unicode. These are the ones that are the
easiest to port, because 2to3 was written with nearly that in mind. If
your library was only accepting bytestrings in 2.x it will (after a 2to3
run) only be accepting a Python 3 str
type which is unicode based.
This works well as long as you do not intend to use some kind of IO in
your library. Once you start doing that, you will need to make sure you
can somehow specify the encoding to be used when opening files. In that
case, make sure you open the file in byte mode (not in text mode!) and
do the decoding/encoding yourself. This is the only way your IO code
will work the same in both 2.x and 3.x. But more on IO later.
What 2to3 does out of the box is converting calls from unicode
to
str
automatically. Unfortunately it does not change the special
__unicode__
method to __str__
. You can easily do that in a custom
fixer though, so it should be easy to accomplish. If your library
however supports both __str__
and __unicode__
you are in a more
tricky situation here. Let me show you an example of the kind of
classes I deal with in Jinja 2 for example:
class MyObject(object):
def __init__(self):
self.value = u'some value'
def __str__(self):
return unicode(self).encode('utf-8')
def __unicode__(self):
return self.value
The big problem here is that 2to3 will convert it to this:
class MyObject(object):
def __init__(self):
self.value = 'some value'
def __str__(self):
return str(self).encode('utf-8')
def __unicode__(self):
return self.value
If you call str()
on your instance now, it will die with a runtime
error because it recurses infinitely. Even if it would not recurse, it
would try to return a bytes object from the __str__
method because of
the encode call. My plan was to write a custom fixer that, if it detects
a __str__
that just calls into __unicode__
and encodes, will drop
the __str__
method and rename __unicode__
to __str__
.
Unfortunately the tree you are dealing with in 2to3 does not appear to
be designed to removing code so what I do instead of removing the
__str__
is just renaming the __unicode__
to __str__
and let Python
override the dummy __str__
with the correct one. The fixer I use for
that, looks like this:
from lib2to3 import fixer_base
from lib2to3.fixer_util import Name
class FixRenameUnicode(fixer_base.BaseFix):
PATTERN = r"funcdef< 'def' name='__unicode__' parameters< '(' NAME ')' > any+ >"
def transform(self, node, results):
name = results['name']
name.replace(Name('__str__', prefix=name.prefix))
After conversion with this fixer in place, the class from above will then look like this:
class MyObject(object):
def __init__(self):
self.value = 'some value'
def __str__(self):
return str(self).encode('utf-8')
def __str__(self):
return self.value
But where to put those fixers? Edit 2to3 directly? And do I have to provide two source packages for 2.x and 3.x? This is where distribute comes in.
Distutils itself already has the possibility to run 2to3 for you, but
what it cannot do is adding custom fixers without a lot of custom code.
distribute on the other hand not gives you built in 2to3 support as a
single keyword argument to setup()
but can also pass custom fixers to
2to3 which is very helpful. Because these new keyword arguments however
would warn if the setup script was executed with setuptools instead of
distribute, you should only pass them to the setup function if invoked
from Python 3. The setup script then looks like this:
import sys
from setuptools import setup
# if we are running on python 3, enable 2to3 and
# let it use the custom fixers from the custom_fixers
# package.
extra = {}
if sys.version_info >= (3, 0):
extra.update(
use_2to3=True,
use_2to3_fixers=['custom_fixers']
)
setup(
name='Your Library',
version='1.0',
classifiers=[
# make sure to use :: Python *and* :: Python :: 3 so
# that pypi can list the package on the python 3 page
'Programming Language :: Python',
'Programming Language :: Python :: 3'
],
packages=['yourlibrary'],
# make sure to add custom_fixers to the MANIFEST.in
include_package_data=True,
**extra
)
Now all you have to do is to put the custom 2to3 fixers (written in
Python 3!) into the custom_fixers
package next to your real library
and they will be added automatically. For examples of fixers, look into
the lib2to3/fixes
package or your Python 3 installation. If you run
python3 setup.py build
it will run 2to3 on your files and put the
output into the build folder for you to test.
So in Python 3 there is a completely new input/output system. It is very Java-ish and is able to deal with unicode. The downside is that you either don't have it in 2.x or the implementation is too slow, so what you want to do is to create yourself an abstraction layer.
If your library was unicode based in older Python versions you probably
just did file.read().decode(encoding)
or something similar. This still
works on 3.x and I strongly recommend doing that, but be sure to open
the file in binary mode, otherwise on Python 3 the decode will attempt
to decode an already decoded unicode string, which does not make any
sense. If you need normalized newlines (windows newlines converted to
'