Armin Ronacher's Thoughts and Writings

Porting to Python 3 Redux

written on May 21, 2013

After a very painful porting experience with Jinja2 to Python 3 I basically left the project idling around for a while because I was too afraid of breaking Python 3 support. The approach I used was one codebase that was written in Python 2 and translated with 2to3 to Python 3 on installation. The unfortunate side-effect of that is that any change you do requires about a minute of translation which destroys all your iteration speeds. Thankfully it turns out that if you target the right versions of Python you can do much better.

Thomas Waldmann from the MoinMoin project started running Jinja2 through my python-modernize with the right parameters and ended up with a unified codebase that runs on 2.6, 2.7 and 3.3. With a bit of cleanup afterwards we were able to come up with a nice codebase that runs in all versions of Python and also looks like regular Python code for the most time.

Motivated by the results from this I went through the code a few more times and also started migrating some other code over to experiment more with unified codebases.

This is a selection of some tips and tricks I can now share in regards to accomplishing something similar.

Drop 2.5, 3.1 and 3.2

This is the most important tip. Dropping 2.5 by now is more than possible since very few people are still on it and dropping 3.1 and 3.2 are no brainers anyways considering the low adoption of Python 3 so far. But why would you drop those versions? The basic answer is that 2.6 and 3.3 have a lot of overlapping syntax and features that allow for code that works well with both:

This is particularly useful if you did stream based encoding and decoding. That functionality was plain missing between 3.0 up until 3.3.

Yes, the six module can get you quite far, but don't underestimate the impact of looking at nice code. With the Python 3 port I basically lost interest in maintaining Jinja2 because the codebase started to frustrate me. Back then a unified codebase was looking ugly and had a performance impact (six.b('foo') and six.u('foo') everywhere) or was plagued under the bad iteration speeds of 2to3. Not having to deal with any of that brings the joy back. Jinja2's codebase now looks like very clean and you have to find the Python 2/3 compatibility support. Very few paths in the code actually do something like if PY2:.

The rest of this article assumes that these are the Python versions you want to support. Also attempting to support Python 2.5 is a very painful thing to do and I strongly recommend against it. Supporting 3.2 is possible if you are willing to wrap all your strings in function calls which I don't recommend doing for aesthetic and performance reasons.

Skip Six

Six is a pretty neat library and this is where the Jinja2 port started out with. There however at the end of the day there is not much in six that actually is required for getting a Python 3 port running and a few things are missing. Six is definitely required if you want to support Python 2.5 but from 2.6 or later there really is not much of a reason to use six. Jinja2 ships a _compat module that contains the few helpers required. Including a few lines of non Python 3 code the whole compatibility module is less than 80 lines of code.

This saves you from the troubles where users might expect a different version of six because of another library or pulling in another dependency into your project.

Start with Modernize

To start with the port the python-modernize library is a good start. It is a version of 2to3 that produces code that runs in either. While it's pretty buggy still and the default options are not particularly great, it can get you quite far with regards to doing the boring parts for you. You will still need to go over the result and clean up some imports and ugly results.

Fix your Tests

Before you do anything else walk through your tests and make sure that they still make sense. A lot of the problems in the Python standard library in 3.0 and 3.1 came from the fact that the behavior of the testsuite changed through the conversion to Python 3 in unintended ways.

Write a Compatibility Module

So you're going to skip six, can you live without helpers? The answer is “no”. You will still need a small compatibility module but that is small enough that you can just keep it in your package. Here are some basic examples of what such a compatibility module could look like:

import sys
PY2 = sys.version_info[0] == 2
if not PY2:
    text_type = str
    string_types = (str,)
    unichr = chr
else:
    text_type = unicode
    string_types = (str, unicode)
    unichr = unichr

The exact contents of that module will depend on how much actually changed for you. In case of Jinja2 I put a whole bunch of functions in there. For instance it contains ifilter, imap and similar itertools functions that became builtins in 3.x. (I stuck with the Python 2.x functions to make it clear for the reader of the code that the iterator behavior is intended and not a bug).

Test for 2.x not 3.x

At one point there will be the requirement to check if you are executing on 2.x or 3.x. In that cases I would recommend checking for Python 2 first and putting Python 3 into your else branch instead of the other way round. That way you will have less ugly surprises when a Python 4 comes around at one point.

Good:

if PY2:
    def __str__(self):
        return self.__unicode__().encode('utf-8')

Less ideal:

if not PY3:
    def __str__(self):
        return self.__unicode__().encode('utf-8')

String Handling

The biggest change in Python 3 is without doubt the changes on the Unicode interface. Unfortunately these changes are very painful in some places and also inconsistently handled throughout the standard library. The majority of the time porting will clearly be wasted on this topic. This topic is a whole article by itself but here is a quick cheat sheet for porting that Jinja2 and Werkzeug follow:

This property is very useful for unified codebases because the general trend with Python 3 is to introduce Unicode in some interfaces that previously did not support it, but never the inverse. Since native string literals “upgrade” to Unicode but still somewhat support Unicode in 2.x this string literal is very flexible.

For instance the datetime.strftime function strictly does not support Unicode in Python 2 but is Unicode only in 3.x. Because in most cases the return value on 2.x however was ASCII only things like this work really well in 2.x and 3.x:

>>> u'<p>Current time: %s' % datetime.datetime.utcnow().strftime('%H:%M')
u'<p>Current time: 23:52'

The string passed to strftime is native (so bytes in 2.x and Unicode in 3.x). The return value is a native string again and ASCII only. As such both on 2.x and 3.x it will be a Unicode string once string formatted.

>>> bytearray(b' foo ').strip()
bytearray(b'foo')

Since it's also mutable it's quite efficient at modifying raw bytes and you can trivially convert it to something more conventional by wrapping the final result in bytes() again.

In addition to these basic rules I also add text_type, unichr and string_types variables to my compatibility module as shown above. With those available the big changes are:

I also created a implements_to_string class decorator that helps implementing classes with __unicode__ or __str__ methods:

if PY2:
    def implements_to_string(cls):
        cls.__unicode__ = cls.__str__
        cls.__str__ = lambda x: x.__unicode__().encode('utf-8')
        return cls
else:
    implements_to_string = lambda x: x

The idea is that you just implement __str__ on both 2.x and 3.x and let it return Unicode strings (yes, looks a bit odd in 2.x) and the decorator automatically renames it to __unicode__ for 2.x and adds a __str__ that invokes __unicode__ and encodes the return value to utf-8. This pattern has been pretty common in the past with 2.x modules. For instance Jinja2 and Django use it.

Here is an example for the usage:

@implements_to_string
class User(object):
    def __init__(self, username):
        self.username = username
    def __str__(self):
        return self.username

Metaclass Syntax Changes

Since Python 3 changed the syntax for defining the metaclass to use in an incompatible way this makes porting a bit harder than it should be. Six has a with_metaclass function that can work around this issue but it generates a dummy class that shows up in the inheritance tree. For Jinja2 I was not happy enough with that solution and modified it a bit. The external API is the same but the implementation uses a temporary class to hook in the metaclass. The benefit is that you don't have to pay a performance penalty for using it and your inheritance tree stays nice.

The code is a bit hard to understand. The basic idea is exploiting the idea that metaclasses can customize class creation and are picked by by the parent class. This particular implementation uses a metaclass to remove its own parent from the inheritance tree on subclassing. The end result is that the function creates a dummy class with a dummy metaclass. Once subclassed the dummy classes metaclass is used which has a constructor that basically instances a new class from the original parent and the actually intended metaclass. That way the dummy class and dummy metaclass never show up.

This is what it looks like:

def with_metaclass(meta, *bases):
    class metaclass(meta):
        __call__ = type.__call__
        __init__ = type.__init__
        def __new__(cls, name, this_bases, d):
            if this_bases is None:
                return type.__new__(cls, name, (), d)
            return meta(name, bases, d)
    return metaclass('temporary_class', None, {})

And here is how you use it:

class BaseForm(object):
    pass

class FormType(type):
    pass

class Form(with_metaclass(FormType, BaseForm)):
    pass

Dictionaries

One of the more annoying changes in Python 3 are the changes on the dictionary iterator protocols. In Python 2 all dictionaries had keys(), values() and items() that returned lists and iterkeys(), itervalues() and iteritems() that returned iterators. In Python 3 none of that exists any more. Instead they were replaced with new methods that return view objects.

keys() returns a key view which behaves like some sort of read-only set, values() which returns a read-only container and iterable (not an iterator!) and items() which returns some sort of read-only set-like object. Unlike regular sets it however can also point to mutable objects in which case some methods will fail at runtime.

On the positive side a lot of people missed that views are not iterators so in many cases you can just ignore that. Werkzeug and Django implement a bunch of custom dictionary objects and in both cases the decision was made to just ignore the existence of view objects and let keys() and friends return iterators.

This is currently the only sensible thing to do due to limitations of the Python interpreter. There are a few problems with it:

This is what the Jinja2 codebase went with for iterating over dictionaries:

if PY2:
    iterkeys = lambda d: d.iterkeys()
    itervalues = lambda d: d.itervalues()
    iteritems = lambda d: d.iteritems()
else:
    iterkeys = lambda d: iter(d.keys())
    itervalues = lambda d: iter(d.values())
    iteritems = lambda d: iter(d.items())

For implementing dictionary like objects a class decorator can become useful again:

if PY2:
    def implements_dict_iteration(cls):
        cls.iterkeys = cls.keys
        cls.itervalues = cls.values
        cls.iteritems = cls.items
        cls.keys = lambda x: list(x.iterkeys())
        cls.values = lambda x: list(x.itervalues())
        cls.items = lambda x: list(x.iteritems())
        return cls
else:
    implements_dict_iteration = lambda x: x

In that case all you need to do is to implement the keys() and friends method as iterators and the rest happens automatically:

@implements_dict_iteration
class MyDict(object):
    ...

    def keys(self):
        for key, value in iteritems(self):
            yield key

    def values(self):
        for key, value in iteritems(self):
            yield value

    def items(self):
        ...

General Iterator Changes

Since iterators changed in general a bit of help is needed to make this painless. The only change really is the transition from next() to __next__. Thankfully this is already transparently handled. The only thing you really need to change is to go from x.next() to next(x) and the language does the rest.

If you plan on defining iterators a class decorator again becomes helpful:

if PY2:
    def implements_iterator(cls):
        cls.next = cls.__next__
        del cls.__next__
        return cls
else:
    implements_iterator = lambda x: x

For implementing this class just name the iteration step method __next__ in all versions:

@implements_iterator
class UppercasingIterator(object):
    def __init__(self, iterable):
        self._iter = iter(iterable)
    def __iter__(self):
        return self
    def __next__(self):
        return next(self._iter).upper()

Transformation Codecs

One of the nice features of the Python 2 encoding protocol was that it was independent of types. You could register an encoding that would transform a csv file into a numpy array if you would have preferred that. This feature however was not well known since the primary exposed interface of encodings was attached to string objects. Since they got stricter in 3.x a lot of that functionality was removed in 3.0 but later reintroduced in 3.3 again since it proved useful. Basically all codecs that did not convert between Unicode and bytes or the other way round were unavailable until 3.3. Among those codecs are the hex and base64 codec.

There are two use cases for those codecs: operations on strings and operations on streams. The former is well known as str.encode() in 2.x but now looks different if you want to support 2.x and 3.x due to the changed string API:

>>> import codecs
>>> codecs.encode(b'Hey!', 'base64_codec')
'SGV5IQ==\n'

You will also notice that the codecs are missing the aliases in 3.3 which requires you to write 'base64_codec' instead of 'base64'.

(These codecs are preferable over the functions in the binascii module because they support operations on streams through the incremental encoding and decoding support.)

Other Notes

There are still a few places where I don't have nice solutions for yet or are generally annoying to deal with but they are getting fewer. Some of them are unfortunately now part of the Python 3 API are hard to discover until you trigger an edge case.

Generally I noticed the most reliable way to do text on Python 3 that also works okay on 2.x is just to open files in binary mode and explicitly decode. Alternatively you can use the codecs.open or io.open function on 2.x and the builtin open on Python 3 with an explicit encoding.

if PY2:
    exec('def reraise(tp, value, tb):\n raise tp, value, tb')
else:
    def reraise(tp, value, tb):
        raise value.with_traceback(tb)
exec_ = lambda s, *a: eval(compile(s, '<string>', 'exec'), *a)

Outlook

Unified codebases for 2.x and 3.x are definitely within reach now. The majority of the porting time will still be spend trying to figure out how APIs are going to behave with regards to Unicode and interoperability with other modules that might have changed their API. In any case if you want to consider porting libraries don't bother with versions outside below 2.5, 3.0-3.2 and it will not hurt as much.

This entry was tagged python and thoughts